Method For Sending Multimedia Bookmarks Over A Network

ABSTRACT

A method and system are provided for tagging, indexing, searching, retrieving, manipulating, and editing video images on a wide area network such as the Internet. A first set of methods is provided for enabling users to add bookmarks to multimedia files, such as movies, and audio files, such as music. The multimedia bookmark facilitates the searching of portions or segments of multimedia files, particularly when used in conjunction with a search engine. Additional methods are provided that reformat a video image for use on a variety of devices that have a wide range of resolutions by selecting some material (in the case of smaller resolutions) or more material (in the case of larger resolutions) from the same multimedia file. Still more methods are provided for interrogating images that contain textual information (in graphical form) so that the text may be copied to a tag or bookmark that can itself be indexed and searched to facilitate later retrieval via a search engine.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuing application that is a divisional ofcommonly-owned, copending U.S. patent application Ser. No. 09/911,293,filed Jul. 23, 2001 by Sull et al.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to marking multimedia files.More specifically, the present invention relates to applying orinserting tags into multimedia files for indexing and searching, as wellas for editing portions of multimedia files, all to facilitate thestoring, searching, and retrieving of the multimedia information.

2. Background of the Related Art

1. Multimedia Bookmarks

With the phenomenal growth of the Internet, the amount of multimediacontent that can be accessed by the public has virtually exploded. Thereare occasions where a user who once accessed particular multimediacontent needs or desires to access the content again at a later time,possibly at or from a different place. For example, in the case of datainterruption due to a poor network condition, the user may be requiredto access the content again. In another case, a user who once viewedmultimedia content at work may want to continue to view the content athome. Most users would want to restart accessing the content from thepoint where they had left off. Moreover, subsequent access may beinitiated by a different user in an exchange of information betweenusers. Unfortunately, multimedia content is represented in a streamingfile format so that a user has to view the file from the beginning inorder to look for the exact point where the first user left off.

In order to save the time involved in browsing the data from thebeginning, the concept of a bookmark may be used. A conventionalbookmark marks a document such as a static web page for later retrievalby saving a link (address) to the document. For example, Internetbrowsers support a bookmark facility by saving an address called aUniform Resource Identifier (URI) to a particular file. InternetExplorer, manufactured by the Microsoft Corporation of Redmond, Wash.,uses the term “favorite” to describe a similar concept.

Conventional bookmarks, however, store only the information related tothe location of a file, such as the directory name with a file name, aUniversal Resource Locator (URL), or the URI. The files referred to byconventional bookmarks are treated in the same way regardless of thedata formats for storing the content. Typically, a simple link is usedfor multimedia content also. For example, to link to a multimediacontent file through the Internet, a URI is used. Each time the file isrevisited using the bookmark, the multimedia content associated with thebookmark is always played from the beginning.

FIG. 1 illustrates a list 108 of conventional bookmarks 110, eachcomprising positional information 112 and title 114. The positionalinformation 112 of a conventional bookmark is composed of a URI as wellas a bookmarked position 106. The bookmarked position is a relative timeor byte position measured from a beginning of the multimedia content.The title 114 can be specified by a user, as well as delivered with thecontent, and it is typically used to make the user easily recognize thebookmarked URI in a bookmark list 108. For the case of a conventionalbookmark without using a bookmarked position, when a user wants toreplay the specified multimedia file, the file is played from thebeginning of the file each time, regardless of how much of the file theuser has already viewed. The user has no choice but to record the lastaccessed position on a memo and to move manually the last stopped point.If the multimedia file is viewed by streaming, the user must go througha series of buffering to find out the last accessed position, thuswasting much time. Even for the conventional bookmark with a bookmarkedposition, the same problem occurs when the multimedia content isdelivered in live broadcast, since the bookmarked position within themultimedia content is not usually available, as well as when the userwants to replay one of the variations of the bookmarked multimediacontent.

Further, conventional bookmarks do not provide a convenient way ofswitching between different data formats. Multimedia content may begenerated and stored in a variety of formats. For example, video may bestored in the formats such as MPEG, ASF, RM, MOV, and AVI. Audio may bestored in the formats such as MID, MP3, and WAV. There may be occasionswhere a user wants to switch the play of content from one format toanother. Since different data formats produced from the same multimediacontent are often encoded independently, the same segment is stored atdifferent temporal positions within the different formats. Sinceconventional bookmarks have no facility to store any contentinformation, users have no choice but to review the multimedia contentfrom the beginning and to search manually for the last-accessed segmentwithin the content.

Time information may be incorporated into a bookmark to return to thelast-accessed segment within the multimedia content. The use of timeinformation only, however, fails to return to exactly the same segmentat a later time for the following reasons. If a bookmark incorporatingtime information was used to save the last-accessed segment during thepreview of multimedia content broadcast, the bookmark information wouldnot be valid during a regular full-version broadcast, so as to return tothe last-accessed segment. Similarly, if a bookmark incorporating timeinformation was used to save the last-accessed segment during real-timebroadcast, the bookmark would not be effective during later accessbecause the later available version may have been edited or a time codewas not available during the real-time broadcast.

Many video and audio archiving systems, consisting of severaldifferently compressed files called “variations”, could be produced froma single source multimedia content. Many web-casting sites providemultiple streaming files for a single video content with differentbandwidths according to each video format. For example, CNN.com providesfive different streaming videos for a single video content: twodifferent types of streaming videos with the bandwidths of 28.8 kbps and80 kbps, both encoded in Microsoft's Advanced Streaming Format (ASF).CNN.com also provides RM streaming format by RealNetworks, Inc. ofSeattle, Wash. (RM), and a streaming video with the smart bandwidthencoded in Apple Computer, Inc.'s QuickTime streaming format (MOV). Inthis case, the five video files may start and end at different timepoints from the viewpoint of the source video content, since eachvariation may be produced by an independent encoding process varying thevalues chosen for encoding formats, bandwidths, resolutions, etc. Thisresults in mismatches of time points because a specific time point ofthe source video content may be presented as different media time pointsin the five video files.

When a multimedia bookmark is utilized, the mismatches of positionscause a problem of mis-positioned playback. Consider a simple case whereone makes a multimedia bookmark on a master file of a multimedia content(for example, video encoded in a given format), and tries to playanother variation (for example, video encoded in a different format)from the bookmarked position. If the two variations do not start at thesame position of the source content, the playback will not start at thebookmarked position. That is, the playback will start at the positionthat is temporally shifted with the difference between the startpositions of the two variations.

The entire multimedia presentation is often lengthy. However, there arefrequent occasions when the presentation is interrupted, voluntarily orforcibly, to terminate before finishing. Examples include a user whostarts playing a video at work leaves the office and desires to continuewatching the video at home, or a user who may be forced to stop watchingthe video and log out due to system shutdown. It is thus necessary tosave the termination position of the multimedia file into persistentstorage in order to return directly to the point of termination withouta time-consuming playback of the multimedia file from the beginning.

The interrupted presentation of the multimedia file will usually resumeexactly at the previously saved terminated position. However, in somecases, it is desirable to begin the playback of the multimedia file acertain time before the terminated point, since such rewinding couldhelp refresh the user's memory.

In the prior art, the EPG (Electronic Program Guide) has played acrucial role as a provider of TV programming information. EPGfacilitates a user's efforts to search for TV programs that he or shewants to view. However, EPG's two-dimensional presentation (channels vs.time slots) becomes cumbersome as terrestrial, cable, and satellitesystems send out thousands of programs through hundreds of channels.Navigation through a large table of rows and columns in order to searchfor desired programs is frustrating.

One of the features provided by the recent set-top box (STB) is thepersonal video recording (PVR) that allows simultaneous recording andplayback. Such STB usually contains digital video encoder/decoder basedon an international digital video compression standard such as MPEG-1/2,as well as the large local storage for the digitally compressed videodata. Some of the recent STBs also allow connection to the Internet.Thus, STB users can experience new services such as time-shifting andweb-enhanced television (TV).

However, there still exist some problems for the PVR-enabled STBs. Thefirst problem is that even the latest STBs alone cannot fully satisfyusers' ever-increasing desire for diverse functionalities. The STBs nowon the market are very limited in terms of computing and memory and soit is not easy to execute most CPU and memory intensive applications.For example, the people who are bored with plain playback of therecorded video may desire more advanced features such as videobrowsing/summary and search. Actually, all of those features requiremetadata for the recorded video. The metadata are usually the datadescribing content, such as the title, genre and summary of a televisionprogram. The metadata also include audiovisual characteristic data suchas raw image data corresponding to a specific frame of the video stream.Some of the description is structured around “segments” that representspatial, temporal or spatio-temporal components of the audio-visualcontent. In the case of video content, the segment may be a singleframe, a single shot consisting of successive frames, or a group ofseveral successive shots. Each segment may be described by someelementary semantic information using texts. The segment is referencedby the metadata using media locators such as frame number or time codes.However, the generation of such video metadata usually requiresintensive computation and a human operator's help, so practicallyspeaking, it is not feasible to generate the metadata in the currentSTB. Thus, one possible solution for this problem is to generate themetadata in the server connected to the STB and to deliver it to the STBvia network. However, in this scenario, it is essential to know thestart position of recorded video with respect to the video stream usedto generate the metadata in the server/content provider in order tomatch the temporal position referenced by the metadata to the positionof the recorded video.

The second problem is related to discrepancy between the two timeinstants: the time instant at which the STB starts the recording of theuser-requested TV program, and the time instant at which the TV programis actually broadcast. Suppose, for instance, that a user initiated PVRrequest for a TV program scheduled to go on the air at 11:30 AM, but theactual broadcasting time is 11:31 AM. In this case, when the user wantsto play the recorded program, the user has to watch the unwanted segmentat the beginning of the recorded video, which lasts for one minute. Thistime mismatch could bring some inconvenience to the user who wants toview only the requested program. However, the time mismatch problem canbe solved by using metadata delivered from the server, for example,reference frames/segment representing the beginning of the TV program.The exact location of the TV program, then, can be easily found bysimply matching the reference frames with all the recorded frames forthe program.

2. Search

The rapid expansion of the World Wide Web (WWW) and mobilecommunications has also brought great interest in efficient multimediadata search, browsing and management. Content-based image retrieval(CBIR) is a powerful concept for finding images based on image contents,and content-based image search and browsing have been tested using manyCBIR systems. See, M. Flickner, Harpreet Sawhney, Wayne Niblack,Jonathan Ashley, Q. Huang, Byron Dom, Monika Gorkani, Jim Hafine, DenisLee, Dragutin Petkovic, David Steele and Peter Yanker, “Query by imageand video content: The QBIC system,” IEEE Computer, Vol. 28. No. 9, pp.23-32, September, 1995; Carson, Chad et al., “Region-Based ImageQuerying [Blobworld],” Workshop on Content-Based Access of Image andVideo Libraries, Puerto Rico, June 1997; J. R. Smith and S. Chang,“Visually searching the web for content,” IEEE Multimedia Magazine, Vol.4, No. 3, pp. 12-20, Summer 1997, also Columbia U. CU/CTR TechnicalReport 459-96-25; A. Pentland, R. W. Picard and S. Sclaroff, “APhotobook: tools for content-based manipulation of image databases,” inProc. Of SPIE Conf. On Storage and Retrieval for Image and VideoDatabase-II, No. 2185, pp. 34-47, San Jose, Calif., February, 1944; J.R. Bach, C. Fuller, A. Guppy, A. Hampapur, B. Horowitz, R. Humphrey, R.C. Jain and C. Shu, “Virage image search engine: an open framework forimage management,” Symposium on Electronic Imaging: Science andTechnology—Storage & Retrieval for Image and Video Databases IV,IS&T/SPIE '96, February, 1996; J. R. Smith and S. Chang, “VisualSEEk: AFully Automated Content-Based Image Query System,” ACM MultimediaConference, Boston, Mass., November 1996; Jing Huang, S. Ravi Kumar,Mandar Mitra, Wei-Jing Zhu and Ramin Zabih. “Image Indexing Using ColorCorrelograms,” in IEEE Conference on Computer Vision and PatternRecognition, pp. 762-768, June, 1997; and Simone Santini, and RameshJain, “The ‘El Nino’ Image Database System,” in International Conferenceon Multimedia Computing and Systems, pp. 524-529, June, 1999.

Currently, most of the content-based image search engines rely onlow-level image features such as color, texture and shape. Whilehigh-level image descriptors are potentially more intuitive for commonusers, the derivation of high-level descriptors is still in itsexperimental stages in the field of computer vision and requires complexvision processing. Despite its efficiency and ease of implementation, onthe other hand, the main disadvantage of low-level image features isthat they are perceptually non-intuitive for both expert and non-expertusers, and therefor, do not normally represent users' intenteffectively. Furthermore, they are highly sensitive to a small amount ofimage variation in feature shape, size, position, orientation,brightness and color. Perceptually similar images are often highlydissimilar in terms of low-level image features. Searches made bylow-level features are often unsuccessful and it usually takes manytrials to find images satisfactory to a user.

Efforts have been made to overcome the limitations of low-levelfeatures. Relevance feedback is a popular idea for incorporating user'sperceptual feedback in the image search. See, Y. Rui, T. Huang, and S.Mehrota, “A relevance feedback architecture in content-based multimediainformation retrieval systems,” in IEEE Workshop on Content-based Accessof Image and Video Libraries, Puerto Rico, pp. 82-89, June, 1997; YongRui, Thomas S. Huang, Michael Ortega, and Sharad Mehrotra, “RelevanceFeedback: A Power Tool in Interactive Content-Based Image Retrieval,” inIEEE Tran on Circuits and Systems for Video Technology, Special Issue onSegmentation, Description, and Retrieval of Video Content, pp. 644-655,Vol. 8, No. 5, September, 1998; G. Aggarwal, P. Dubey, S. Ghosal, A.Kulshreshtha, and A. Sarkar, “iPURE: perceptual and user-friendlyretrieval of images,” in Proc. of IEEE International Conference onMultimedia and Exposition, Vol. 2, pp. 693-696, July, 2000; Ye Lu,Chunhui Hu, Xingquan Zhu, HongJiang Zhang and Qiang Yang, “A unifiedframework for semantics and feature based relevance feedback in imageretrieval systems,” in Proc. of ACM International Conference onMultimedia, pp. 31-37, October, 2000; H. Muller, W. Muller, S.Marchand-Maillet, and T. Pun, “Strategies for positive and negativerelevance feedback in image retrieval,” in Proc. of IEEE Conference onPattern Recognition, Vol. 1, pp. 1043-1046, September, 2000; S. Aksoy,R. M. Haralick, F. A. Cheikh, and M. Gabbouj, “A weighted distanceapproach to relevance feedback,” in Proc. of IEEE Conference on PatternRecognition, Vol. 4, pp. 812-815, September, 2000; I. J. Cox, M. L.Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos, “The Bayesianimage retrieval system, PicHunter:theory, implementation, andpsychophysical experiments,” in IEEE Transaction on Image Processing,Vol. 9, pp. 20-37, January, 2000; P. Muneesawang, and Guan Ling,“Multi-resolution-histogram indexing and relevance feedback learning forimage retrieval,” in Proc. of IEEE International Conference on ImageProcessing, Vol. 2, pp. 526-529, January, 2001. A user can manuallyestablish relevance between a query and retrieved images, and therelevant images can be used for refining the query. When the refinementis made by adjusting a set of low-level feature weights, however, theuser's intent is still represented by low-level features and their basiclimitations still remain.

Several approaches have been made to the integration of human perceptualresponses and low-level features in image retrieval. One notableapproach is to adjust an image's feature's distance attributes based onthe human perceptual input. See, Simone Santini, and Ramesh Jain, “The‘El Nino’ Image Database System,” in International Conference onMultimedia Computing and Systems, pp. 524-529, June, 1999. Anotherapproach, called “blob world,” combines low-level features to deriveslightly higher-level descriptions and presents the “blobs” of groupedfeatures to a user to provide a better understanding of featurecharacteristics. See, Carson, Chad, et al., “Region-Based Image Querying[Blobworld],” Workshop on Content-Based Access of Image and VideoLibraries, Puerto Rico, June, 1997. While those schemes successfullyreflect a user's intent to some degree, it remains to be seen howgrouping of features or feature distance modification can achieve theperceptual relevance in image retrieval. A more traditional computervision approach to the derivation of high-level object descriptors basedon generic object recognition has been presented for image retrieval.See, David A. Forsyth and Margaret Fleck, “Body Plans,” in IEEEConference on Computer Vision and Pattern Recognition, pp. 678-683,June, 1997. Due to its limited feasibility for general image objects andcomplex processing, its utility is still restricted.

With the rapid proliferation of large image/video databases, there hasbeen an increasing demand for effective methods to search the largeimage/video databases automatically by their content. For a queryimage/video clip given by a user, these methods search the databases forthe images/videos that are most similar to the query. In other words,the goal of the image/video search is to find best matches to the queryimage/video from the database.

Several approaches have been made towards the development of the fast,effective multimedia search methods. Milanes et al. utilizedhierarchical clustering to organize an image database into visuallysimilar groupings. See, R. Milanese, D. Squire, and T. Pun,“Correspondence analysis and hierarchical indexing for content-basedimage retrieval,” in Proc. IEEE Int. Conf. Image Processing, Vol. 3,Lausanne, Switzerland, pp. 859-862, September, 1996. Zhang and Zhongprovided a hierarchical self-organizing map (HSOM) method to organize animage database into a two-dimensional grid. See, H. J. Zhang and D.Zhong, “A scheme for visual feature based image indexing,” in Proc.SPIE/IS&T Conf. Storage Retrieval Image Video Database III, Vol. 2420,pp. 36-46, San Jose, Calif., February, 1995. However, a weakness of HSOMis that it is generally too computationally expensive to apply to alarge multimedia database.

In addition, there are other well known solutions using Voronoi diagram,Kd-tree, and R-tree. See, J. Bentley, “Multidimensional binary searchtrees used for associative searching,” Comm. of the ACM, Vol. 18, No. 9,pp. 509-517, 1975; S. Brin, “Near neighbor search in large metricspaces,” in Proc. 21^(st) Conf. On Very Large Databases (VLDB '95),Zurich, Switzerland, pp. 574-584, 1995. However, it is also known thatthose approaches are not adequate for the high dimensional featurevector spaces, and thus, they are useful only in low dimensional featurespaces.

Peer to Peer Searching

Peer-to-Peer (P2P) is a class of applications making the most ofpreviously unused resources (for example, storage, content, and/or CPUcycles), which are available on the peers at the edges of networks. P2Pcomputing allows the peers to share the resources and services, or toaggregate CPU cycles, or to chat with each other, by direct exchange.Two of the more popular implementations of P2P computing are Napster andGnutella. Napster has its peers register files with a broker, and usesthe broker to search for files to copy. The broker plays the role ofserver in a client-server model to facilitate the interaction betweenthe peers. Gnutella has peers register files with network neighbors, andsearches the P2P network for files to copy. Since this model does notrequire a centralized broker, Gnutella is considered to be a true P2Psystem.

3. Editing

In the prior art, video files were edited through video editing softwareby copying several segments of the input videos and pasting them to anoutput video. The prior art method, however, confronts two majorproblems mentioned below.

The first problem of the prior art method is that it requires additionalstorage to store the new version of an edited video file. Conventionalvideo editing software generally uses the original input video file tocreate an edited video. In most of the cases, editors having a largedatabase of videos attempt to edit the videos to create a new one. Inthis case, the storage is wasted storing duplicated portions of thevideo.

The second problem with the prior art method is that a whole newmetadata have to be generated for a newly created video. If the metadataare not edited in accordance with the edition of the video, even if themetadata for the specific segment of the input video are alreadyconstructed, the metadata may not accurately reflect the content.Because considerable effort is required to create the metadata ofvideos, it is desirable to reuse efficiently existing metadata, ifpossible.

Metadata of a video segment contain textual information such as timeinformation (for example, starting frame number and duration, orstarting frame number as well as the finishing frame number), title,keyword, and annotation, as well as image information such as the keyframe of a segment. The metadata of segments can form a hierarchicalstructure where the larger segment contains the smaller segments.Because it is hard to store both the video and their metadata into asingle file, the video metadata are separately stored as a metafile, orstored in a database management system (DBMS).

If metadata having a hierarchical structure are used, browsing a wholevideo, searching for a segment using the keyword and annotation of eachsegment, and using the key frames of each segment for visual summary ofthe video are supported. Also, not only does it support the existingsimple playback, but also the playback and repeated playback of aspecific segment. Therefor, the use of hierarchically-structuredmetadata is becoming popular.

4. Transcoding

With the advance of information technology, such as the popularity ofthe Internet, multimedia presentation proliferates into ever increasingkinds of media, including wireless media. Multimedia data are accessedby ever increasing kinds of devices such as hand-held computers (HHCs),personal digital assistants (PDAs), and smart cellular phones. There isa need for accessing multimedia content in a universal fashion from awide variety of devices. See, J. R. Smith, R. Mohan and C. Li,“Transcoding Internet Content for Heterogeneous Client Devices,” inProc. ISCASA, Monterey, Calif., 1998.

Several approaches have been made to enable effectively such universalmultimedia access (UMA). A data representation, the InfoPyramid, is aframework for aggregating the individual components of multimediacontent with content descriptions, and methods and rules for handlingthe content and content descriptions. See, C. Li, R. Mohan and J. R.Smith, “Multimedia Content Description in the InfoPyramid,” in Proc.IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, May,1998. The InfoPyramid describes content in different modalities, atdifferent resolutions and at multiple abstractions. Then a transcodingtool dynamically selects the resolutions or modalities that best meetthe client capabilities from the InfoPyramid. J. R. Smith proposed anotion of importance value for each of the regions of an image as a hintto reduce the overall data size in bits of the transcoded image. See, J.R. Smith, R. Mohan and C. Li, “Content-based Transcoding of Images inthe Internet,” in Proc. IEEE Intern. Conf. on Image Processing, October,1998; S. Paek and J. R. Smith, “Detecting Image Purpose in World-WideWeb Documents,” in Proc. SPIE/IS&T Photonics West, Document Recognition,January, 1998. The importance value describes the relative importance ofthe region/block in the image presentation compared with the otherregions. This value ranges from 0 to 1, where 1 stands for the highestimportant region and 0 for the lowest. For example, the regions of highimportance are compressed with a lower compression factor than theremaining part of the image. Then, the other parts of the image arefirst blurred and then compressed with a higher compression factor inorder to reduce the overall data size of the compressed image.

When an image is transmitted to a variety of client devices withdifferent display sizes, a scaling mechanism, such as format/resolutionchange, bit-wise data size reduction, and object dropping, is needed.More specifically, when an image is transmitted to a variety of clientdevices with different display sizes, a system should generate atranscoded (e.g., scaled and cropped) image to fit the size of therespective client display. The extent of transcoding depends on the typeof objects embedded in the image, such as cards, bridges, face, and soforth. Consider, for example, an image containing an embedded text or ahuman face. If the display size of a client device is smaller than thesize of the image, sub-sampling and/or cropping to fit the clientdisplay must reduce the spatial resolution of the image. Users veryoften in such a case have difficulty in recognizing the text or thehuman face due to the excessive resolution reduction. Although theimportance value may be used to provide information on which part of theimage can be cropped, it does not provide a quantified measure ofperceptibility indicating the degree of allowable transcoding. Forexample, the prior art does not provide the quantitative information onthe allowable compression factor with which the important regions can becompressed while preserving the minimum fidelity that an author or apublisher intended. The InfoPyramid does not provide either thequantitative information about how much the spatial resolution of theimage can be reduced or ensure that the user will perceive thetranscoded image as the author or publisher initially intended.

5. Visual Rhythm

Fast Construction of Visual Rhythm

Once the digital video is indexed, more manageable and efficient formsof retrieval may be developed based on the index that facilitate storageand retrieval. Generally, the first step for indexing and retrieving ofvisual data is to temporally segment the input video, that is, to findshot boundaries due to camera shot transitions. The temporally segmentedshots can improve the storing and retrieving of visual data if keywordsto the shots are also available. Therefor, a fast and accurate automaticshot detector needs to be developed as well as an automatic text captiondetector to automatically annotate keywords to the temporally segmentedshots.

Even if abrupt scene changes are relatively easy to detect, it is moredifficult to identify special effects, such as dissolve and wipe.Unfortunately, these special effects are normally used to stress theimportance of the scene change (from a content point of view), so theyare extremely relevant therefor they should not be missed. However, thewipe sequence detection method, relative to dissolve sequence, is lessdiscussed and concerned. For scene change detection, a matching processbetween two consecutive frames is required. In order to segment a videosequence into shots a dissimilarity measure between two frames must bedefined. This measure must return a high value only when two frames fallin different shots. Several researchers have used the dissimilaritymeasure based on the luminance or color histogram, correlogram, or anyother visual feature to match two frames. However, these approachesusually produce many false alarms and it is very hard for humans toexactly locate various types of shots (especially dissolves and wipes)of a given video even when the dissimilarity measure between two framesare plotted, for example when they are plotted in 1-D graph where thehorizontal axis represents time of a video sequence and the verticalaxis represents the dissimilarity values between the histograms of theframes along time. They also require high computation load to handledifferent shapes, directions and patterns of various wipe effects.Therefor, it is important to develop a tool that enables human operatorto efficiently verify the results of automatic shot detection wherethere usually might be many falsely detected and missing shots. Visualrhythm satisfies much of the above conditions. Visual rhythm containsdistinctive patterns or visual features for many type of video editingeffects, especially for all wipe-like effects which manifest as visuallydistinguishable lines or curves on the visual rhythm with very littlecomputational time, which enables an easy verification of automaticallydetected shots by human without actually playing the whole individualframe sequence to minimize or possible eliminate all false as well asmissing shots. Visual rhythm on the other hand contains visual featuresreadily available to detect caption text also. See, H. Kim, J. Lee andS. M. Song, “An efficient graphical shot verifier incorporating visualrhythm”, in Proceedings of IEEE International Conference on MultimediaComputing and Systems, pp. 827-834, June, 1999.

Detecting Text in Video and Graphic Images

As contents become readily available on wide area networks such as theInternet, archiving, searching, indexing and locating desired content inlarge volumes of multimedia containing image and video, in addition tothe text information, will become even more difficult. One importantsource of information about image and video is the text containedtherein. The video can be easily indexed if access to this textualinformation content is available. The text provides clear semantics ofvideo and are extremely useful in deducing the contents of video.

There are many ways that segment and recognize text in printeddocuments. Current video research tackles the text caption recognitionproblem as a series of sub-problems to: (a) identify the existence andlocation of text captions in complex background; (b) segment textregions; and (c) post-process the text regions for recognition using astandard OCR. Most current research focuses on tackling sub-problems (a)and (b) in raw spatial domain, with a few methods that can be extendedto compressed domain processing.

A large number of methods has been studied extensively in recent yearsto detect text frames in uncompressed images and video. Ohya et al.performed character extraction through local thresholding and detectedcharacter candidate regions by evaluating gray level differences betweenadjacent regions. See, J. Ohya, A, Shio and S. Akamatsu, “RecognizingCharacters in Scene Image,” in IEEE Trans. On pattern Analysis andMachine Intelligence, Vol. 16, pp. 214-224. Haupmann and Smith used thespatial context of text and high contrast of text regions in sceneimages to merge large numbers of horizontal and vertical edges inspatial proximity to detect text. See, A. Haupmann, M. Smith, “Text,Speech, and Vision for Video Segmentation: The Informedia Project,” inAAAI Symposium on Computational Models for Integrating Language andVision, 1995. Shim et al. introduced a generalized region labelingalgorithm to find homogeneous regions for text extraction. See, J. Shim,C. Dorai and M. Smith, “Automatic Text Extraction from Video forContent-Based Annotation and Retrieval,” in Proc. ICPR, pp. 618-620,1998. Manmatha showed the algorithm to detect and segment texts asregions of distinctive texture using pyramid technique for handling textfonts of different sizes. See, W. Manmatha, “Finding Text in Images,” inProc. of ACM Int'l Conf. On Digital Libraries, 3-12. Lienhart and Stuberprovided Split- and-Merge algorithm based on characteristics ofartificial text to segment text. See, R. Lienhart, “Automatic TextRecognition for Video Indexing,” in Proc. Of ACM MM, pp. 11-20. Doermannand Kia used wavelet analysis and employed a multi-frame coherenceapproach to cluster edges into rectangular shape. See, L. Doermann, O.Kia, “Automatic Text Detection and Tracking in Digital Video,” in IEEETrans. On Image Processing, Vol. 9, pp. 147-156. Sato et al. adopted amulti-frame integration technique to separate static text from movingbackground. See, T. Sato, T. Kanade and S. Satoh, “Video OCR: IndexingDigital News Libraries by Recognition of Superimposed Captions,” inMultimedia Systems, Vol. 7, pp. 385-394.

Finally, several compressed domain methods have also been proposed todetect text regions. Yeo and Liu proposed a method for the detection oftext caption events in video by modified scene change detection whichcannot handle captions that gradually enter or disappear from frames.See, B. L. Yeo, “Visual Content Highlighting Visa Automatic Extractionof Embedded Captions on MPEG Compressed Video,” in SPIE/IS&TSymp. onElectronic Imaging Science and Technology, Vol. 2668, 1996. Zhong et al.examined the horizontal variations of AC values in DCT to locate textframes and examined the vertical intensity variation within the textregions to extract the final text frames. See, Y. Zhong, K. Karu and A.Jain, “Automatic captions localization in compressed video,” in IEEETrans. On PAMI, 22 (4), pp. 385-392. Zhong derived a binarized gradientenergy representation directly from DCT coefficients which are subjectto constraints on text properties and temporal coherence to locate text.See, Y. Zhong, “Detection of text captions in compressed domain video,”in Proc. Of Multimedia Information Retrieval Workshop ACM Multimedia'2000, Nov. 201-204. However, most of the compressed domain methodsrestrict the detection of text in I-frames of a video because it istime-consuming to obtain the AC values in DCT for intra-frame codedframes.

There is, therefor, a need in the art for a method and system that willenable the tagging of multimedia images for indexing, editing, searchingand retrieving. There is also a need in the art to enable the indexingof textual information that is embedded in graphical images or othermultimedia data so that the text in the image can also be tagged,indexed, searched and retrieved, as is other textual information.Further, there is also a need in the art for editing multimedia data fordisplay, indexing, and searching in ways the prior art does not provide.

SUMMARY OF THE INVENTION

The invention overcomes the above-identified problems as well as othershortcomings and deficiencies of existing technologies by providing

1. Multimedia Bookmark The present invention provides a system andmethod for accessing multimedia content stored in a multimedia filehaving a beginning and an intermediate point, the content having atleast one segment at the intermediate point. At a minimum, the systemincludes a multimedia bookmark, the multimedia bookmark having contentinformation about the segment at the intermediate point, wherein a usercan utilize the multimedia bookmark to access the segment withoutaccessing the beginning of the multimedia file.

The system of the present invention can include a wide area network suchas the Internet. Moreover, the method of the present invention canfacilitate the creating, storing, indexing, searching, retrieving andrendering of multimedia content on any device capable of connecting tothe network and performing one or more of the aforementioned functions.The multimedia content can be one or more frames of video, audio data,text data such as a string of characters, or any combination orpermutation thereof.

The system of the present invention includes a search mechanism thatlocates a segment in the multimedia file. An access mechanism isincluded in the system that reads the multimedia content at the segmentdesignated by the multimedia bookmark. The multimedia content can bepartial data that are related to a particular segment.

The multimedia bookmark used in conjunction with the system of thepresent invention includes positional information about the segment. Thepositional information can be a URI, an elapsed time, a time code, orother information. While the multimedia file used in conjunction withthe system of the present invention can be contained on local storage,it can also be stored at remote locations.

The system of the present invention can be a computer server that isoperably connected to a network that has connected to it one or moreclient devices. Local storage on the server can optionally include adatabase and sufficient circuitry and/or logic, in the form of hardwareand/or software in any combination that facilitates the storing,indexing, searching, retrieving and/or rendering of multimediainformation.

The present invention further provides a methodology and implementationfor adaptive refresh rewinding, as opposed to traditional rewinding,which simply performs a rewind from a particular position by apredetermined length. For simplicity, the exemplary embodiment describedbelow will demonstrate the present invention using video data. Threeessential parameters are identified to control the behavior of adaptiverefresh rewinding, that is, how far to rewind, how to select certainframes in the rewind interval, and how to present the chosen refreshvideo frames on a display device.

The present invention also provides a new way to generate and deliverprogramming information that is customized to the user's viewingpreferences. This embodiment of the present invention removes thenavigational difficulties associated with EPG. Specifically, dataregarding the user's habits of recording, scheduling, and/or accessingTV programs or Internet movies are captured and stored. Over a longperiod of time, these data can be analyzed and used to determine theuser's trends or patterns that can be used to predict future viewingpreferences.

The present invention also relates to the techniques to solve the twoproblems by downloading the metadata from a distant metadata server andthen synchronizing/matching the content with the received metadata.While this invention is described in the context of video content storedon STB having PVR function, it can be extended to other multimediacontent such as audio.

The present invention also allows the reuse of the content prerecordedon the analog VCR videotapes. Using the PVR function of STB, once thecontent of the VCR tape is converted into digital video and is stored onthe hard disk on the STB, the present invention works equally well.

The present invention also provides a method for searching for relevantmultimedia content based on at least one feature saved in a multimediabookmark. The method preferably includes transmitting at least onefeature saved in a multimedia bookmark from a client system to a serversystem in response to a user's selection of the multimedia bookmark. Theserver may then generate a query for each feature received and,subsequently, use each query generated to search one or more storagedevices. The search results may be presented to the user uponcompletion.

In yet another embodiment, the present invention provides a method forverifying inclusion of attachments to electronic mail messages. Themethod preferably includes scanning the electronic mail message for atleast one indicator of an attachment to be included and determiningwhether at least one attachment to the electronic mail message ispresent upon detection of the at least one indicator. In the event anindicator is present but an attachment is not, the method preferablyalso includes displaying a reminder to a user that no attachment ispresent.

In yet another embodiment, the present invention provides a method forsearching for multimedia content in a peer to peer environment. Themethod preferably includes broadcasting a message from a user system toannounce its entrance to the peer to peer environment. Active nodes inthe peer to peer environment preferably acknowledge receipt of thebroadcast message while the user system preferably tracks the activenodes. Upon initiation of a search request at the user system, a querymessage including multimedia features is preferably broadcast to thepeer to peer environment. Upon receipt of the query message, amultimedia search engine on a multimedia database included in a storagedevice on one or more active nodes is preferably executed. A searchresults message including a listing of found filenames and networklocations is preferably sent to the user system upon completion of thedatabase search.

The present invention further provides a method for sending a multimediabookmark between devices over a wireless network. The method preferablyincludes acknowledging receipt of a multimedia bookmark by a videobookmark message service center upon receipt of the multimedia bookmarkfrom a sending device. After requesting and receiving routinginformation from a home location register, the video bookmark messageservice center preferably invokes a send multimedia bookmark operationat a mobile switching center. The mobile switching center thenpreferably sends the multimedia bookmark and, upon acknowledgement ofreceipt of the multimedia bookmark by the recipient device, notifies thevideo bookmark message service center of the completed multimediabookmark transaction.

In another embodiment, the present invention provides a method forsending multimedia content over a wireless network for playback on amobile device. In this embodiment, the mobile device preferably sends amultimedia bookmark and a request for playback to a mobile switchingcenter. The mobile switching center then preferably sends the requestand the multimedia bookmark to a video bookmark message service center.The video bookmark message service center then preferably determines asuitable bit rate for transmitting the multimedia content to the mobiledevice. Based on the bit rate and various characteristics of the mobiledevice, the video bookmark message service center also preferablycalculates a new multimedia bookmark. The new multimedia bookmark isthen sent to a multimedia server which streams the multimedia content tothe video bookmark message service center before the multimedia contentis delivered to the mobile device via the mobile switching center.

2. Search

The present invention further provides a new approach to utilizinguser-established relevance between images. Unlike conventionalcontent-based and text-based approaches, the method of the presentinvention uses only direct links between images without relying on imagedescriptors such as low-level image features or textual annotations.Users provide relevance information in the form of relevance feedback,and the information is accumulated in each image's queue of links andpropagated through linked images in a relevance graph. The collection ofdirect image links can be effective for the retrieval of subjectivelysimilar images when they are gathered from a large number of users overa considerable period of time. The present invention can be used inconjunction with other content-based and text-based image retrievalmethods.

The present invention also provides a new method to fast find from alarge database of image/frames the objects close enough to a queryimage/frame under a certain distortion. With the metric property ofdistance function, the information on LBG clustering, and Haar-transformbased fast codebook search algorithm, which is also disclosed herein,the present invention reduces the number of distance evaluations atquery time, thus resulting in fast retrieval of data objects from thedatabase. Specifically, the present invention sorts and stores inadvance the distances to a group of predefined distinguished points(called reference points) in the feature space and performs binarysearches on the distances so as to speed up the search.

The present invention introduces an abstract multidimensional structurecalled hypershell. More practically, the hypershell can be conceived asa set of all the feature vectors in the feature space which lie away r±εfrom its corresponding reference point, where r is the distance betweena query feature point and the reference point, and ε is a real numberindicating the fidelity of search results. And the intersection of suchhypershells leads to some intersected regions which are often smallpartitions of the whole feature space. Therefor, instead of the wholefeature space, the present invention performs the search only on theintersected regions to improve the search speed.

3. Editing

The present invention further provides a new approach to editing videomaterials, in which it only virtually edits the metadata of input videosto create a new video, instead of actually editing videos stored ascomputer files. In the present invention, the virtual editing isperformed either by copying the metadata of a video segment of interestin an input metafile or copying only the URI of the segment into a newlyconstructed metafile. The present invention provides a way of playingthe newly edited video only with its metadata. The present inventionalso provides a system for the virtual editing. The present inventioncan be applied not only to videos stored on CD-ROM, DVD, and hard disk,but also to streaming videos over a network.

The present invention also provides a method for virtual editingmultimedia files. Specifically, the one or more video files areprovided. A metadata file is created for each of the video files, eachof the metadata files having at least one segment to be edited.Thereafter, a single edited metafile is created that contains thesegments to were to be edited from each of the metadata files so thatwhen the edited metadata file is accessed, the user is able to play thesegments to be edited in the edited order.

The present invention also provides a method for virtual editingmultimedia files. Specifically, the one or more video files areprovided. A metadata file is created for each of the video files, eachof the metadata files having at least one segment to be edited.Thereafter, a single edited metafile is created that contains links tothe segments to were to be edited from each of the metadata files sothat when the edited metadata file is accessed, the user is able to playthe segments to be edited in the edited order.

The present invention also includes a method for editing a multimediafile by providing a metafile, the metafile having at least one segmentthat is selectable; selecting a segment in the metafile; determining ifa composing segment should be created, and if the composing segmentshould be created, then creating a composing segment in a hierarchicalstructure; specifying the composing segment as a child of a parentcomposing segment; determining if metadata is to be copied or if a URIis to be used; if the metadata is to be copied, then copying metadata ofthe selected segment to the component segment; if the URI is to be used,then writing a URI of the selected segment to the component segment;writing a URL of an input video file to the component segment;determining if all URLs of any sibling files are the same; and if theURL is the same as any of the sibling's URLs, then writing the URL tothe parent composing segment and deleting the URLs of all siblingsegments.

In a further embodiment, the method for editing a multimedia fileincludes determining if another segment is to be selected and if anothersegment is to be selected, then performing the step of selecting asegment in a metafile.

In yet a further embodiment of the method for editing a multimedia file,the method includes determining if another metafile is to be browsed andif another metafile is to be browsed, then performing the step ofproviding a metafile. The metafiles may be XML files or some otherformat.

The present invention also provides a virtual video editor in oneembodiment. The virtual video editor includes a network controllerconstructed and arranged to access remote metafiles and remote videofiles and a file controller in operative connection to the networkcontroller and constructed and arranged to access local metafiles andlocal video files, and to access the remote metafiles and the remotevideo files via the network controller. A parser constructed andarranged to receive information about the files from the file controllerand an input buffer constructed and arranged to receive parserinformation from the parser are also included in the virtual videoeditor. Further, a structure manager constructed and arranged to providestructure data to the input buffer, a composing buffer constructed andarranged to receive input information from the input buffer andstructure information from the structure manager to generate composinginformation and a generator constructed and arranged to receive thecomposing information from the composing buffer are preferably includedand wherein the generator generates output information in a pre-selectedformat are preferably included.

In a further embodiment, the virtual video editor also includes aplaylist generator constructed and arranged to receive structureinformation from the structure manager in order to generate playlistinformation and a video player constructed and arranged to receive theplaylist information from the playlist generator and file informationfrom the file controller in order to generate display information.

In yet a further embodiment, the virtual video editor also includes adisplay device constructed and arranged to receive the displayinformation from the video player and to display the display informationto a user.

In a further embodiment, the present invention provides a method fortranscoding an image for display at multiple resolutions. Specifically,the method includes providing a multimedia file, designating one or moreregions of the multimedia file as focus zones and providing a vector toeach of the focus zones. The method continues by reading the multimediafile with a client device, the client device having a maximum displayresolution and determining if the resolution of the multimedia fileexceeds the maximum display resolution of the client device. If themultimedia file resolution exceeds the maximum display resolution of thedisplay device, the method determines the maximum number focus zonesthat can be displayed on the client device. Finally, the method includesdisplaying the maximum number of focus zones on the client device.

4. Transcoding

The present invention also provides a novel scheme for generatingtranscoded (scaled and cropped) image to fit the size of the respectiveclient display when an image is transmitted to a variety of clientdevices with different display sizes. The scheme has two keycomponents: 1) perceptual hint for each image block, and 2) an imagetranscoding algorithm. For a given semantically important block in animage, the perceptual hint provides the information on the minimumallowable spatial resolution. Actually, it provides a quantitativeinformation on how much the spatial resolution of the image can bereduced while ensuring that the user will perceive the transcoded imageas the author or publisher want to represent it. The image transcodingalgorithm that is basically a content adaptation process selects thebest image representation to meet the client capabilities whiledelivering the largest content value. The content adaptation algorithmis modeled as a resource allocation problem to maximize the contentvalue.

5. Visual Rhythm

One of the embodiments of the method of the present invention provides afast and efficient approach for constructing visual rhythm. Unlike theconventional approaches which decode all pixels composing a frame toobtain certain group of pixel values using conventional video decoders,the present invention provides a method such that only few of the pixelscomposing a frame are decoded to obtain the actual group of pixelsneeded for constructing visual rhythm. Most video compressions adoptintraframe and interframe coding to reduce spatial as well as temporalredundancies. Therefor, once the group of pixels is determined forconstructing visual rhythm, one only decodes this group of pixels inframes which are not referenced by other frames for interframe coding.For frames referenced by other frames for interframe coding, one decodesthe determined group of pixels for constructing visual rhythm as well asother few pixels needed to decode this group of pixels for framesreferencing to those frames. This allows fast generation of visualrhythm for its application to shot detection, caption text detection, orany other possible applications derived from it.

The other embodiment of the method of present invention provides anefficient and fast-compressed DCT domain method to locate caption textregions in intra-coded and inter-coded frames through visual rhythm fromobservations that caption text generally tend to appear on certain areason video or are known a prior; and secondly, the method employs acombination of contrast and temporal coherence information on the visualrhythm, to detect text frame and uses information obtained throughvisual rhythm to locate caption text regions in the detected text framealong with their temporal duration within the video.

In one embodiment of the present invention, a content transcoder formodifying and forwarding multimedia content maintained in one or moremultimedia content databases to a wide area network for display on arequesting client device is provided. In this embodiment, the contenttranscoder preferably includes a policy engine coupled to the multimediacontent database and a content analyzer operably coupled to both thepolicy engine and the multimedia content database. The contenttranscoder of the present invention also preferably includes a contentselection module operably coupled to both the policy engine and thecontent analyzer and a content manipulation module operably coupled tothe content selection module. Finally, the content transcoder preferablyincludes a content analysis and manipulation library operably coupled tothe content analyzer, the content selection module and the contentmanipulation module. In operation, the policy engine may receive arequest for multimedia content from the requesting client device via thewide area network and policy information from the multimedia contentdatabase. The content analyzer may retrieve multimedia content from themultimedia content database and forward the multimedia content to thecontent selection module. The content selection module may selectportions of the multimedia content based on the policy information andinformation from the content analysis and manipulation library andforward the selected portions of multimedia content to the contentmanipulation module. The content manipulation module may then modify themultimedia content for display on the requesting client device beforetransmitting the modified multimedia content over the wide area networkto

Features and advantages of the invention will be apparent from thefollowing description of the embodiments, given for the purpose ofdisclosure and taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete understanding of the present invention and advantagesthereof may be acquired by referring to the following description takenin conjunction with the accompanying drawings, wherein:

FIG. 1 is an illustration of a conventional prior art bookmark.

FIG. 2 is an illustration of a multimedia bookmark in accordance withthe present invention.

FIG. 3 is an illustration of exemplary searching for multimedia contentrelevant to the content information saved in the multimedia bookmark ofthe present invention, where both positional and content information areused.

FIG. 4 is an illustration of an exemplary tree structure used by twoexemplary search methods in accordance with the present invention.

FIG. 5 is an example of five variations encoded by the present inventionfrom the same source video content.

FIG. 6 is an example of two multimedia contents and their associatedmetadata of the present invention.

FIG. 7 is a list of example multimedia bookmarks of the presentinvention.

FIG. 8 is an illustration of an exemplary method of adjusting bookmarkedpositions in the durable bookmark system of the present invention.

FIG. 9 is an illustration of an exemplary user interface incorporating amultimedia bookmark of the present invention.

FIG. 10 is a flowchart illustrating an exemplary embodiment of a methodof the present invention that is effective to implement the disclosedprocessing system.

FIG. 11 is a flowchart illustrating the overall process of saving andretrieving multimedia bookmarks of the present invention.

FIG. 12 is a flowchart illustrating an exemplary process of playing amultimedia bookmark of the present invention.

FIG. 13 is a flowchart illustrating an exemplary process of deleting amultimedia bookmark of the present invention.

FIG. 14 is a flowchart illustrating an exemplary process of adding atitle to a multimedia bookmark of the present invention.

FIG. 15 is a flowchart illustrating an exemplary process of the presentinvention for searching for the relevant multimedia content based uponcontent, as well as textual information if available.

FIG. 16 is a flow chart illustrating an exemplary process of the presentinvention for sending a bookmark to other people via e-mail.

FIG. 17 is a flowchart illustrating an exemplary method of the presentinvention for e-mailing a multimedia bookmark of the present invention.

FIG. 18 is a block diagram illustrating an exemplary system fortransmitting multimedia content to a mobile device using the multimediabookmark of the present invention.

FIG. 19 is a block diagram illustrating an exemplary message signalarrangement of the present invention between a personal computer and amobile device.

FIG. 20 is a block diagram illustrating an exemplary message signalarrangement of the present invention between two mobile devices.

FIG. 21 is a block diagram illustrating an exemplary message signalarrangement of the present invention between a video server and a mobiledevice.

FIG. 22 is a block diagram illustrating an exemplary data correlationmethod of the present invention.

FIG. 23 is a block diagram illustrating an exemplary swiping techniqueof the present invention.

FIG. 24 is a block diagram illustrating an alternate exemplary swipingtechnique of the present invention.

FIG. 25 is a flowchart illustrating an exemplary peer-to-peer exchangeof the multimedia bookmark of the present invention.

FIG. 26 is a block diagram illustrating different sampling strategies.

FIG. 27 is a block diagram illustrating an exemplary visual rhythmmethod of the present invention.

FIG. 28 is a block diagram illustrating the localization andsegmentation of text information according to the present invention.

FIG. 29 is a block diagram illustrating the use of an exemplary Haartransformation according to the present invention.

FIG. 30 is a block diagram illustrating an exemplary queue for imagelinks of the present invention.

FIG. 31 is a block diagram illustrating an alternate exemplary queue forimage links of the present invention.

FIGS. 32 (a) and (b) are block diagrams illustrating a comparison of aprior art video methodology and an exemplary editing method of thepresent invention.

FIG. 33 is a block diagram illustrating an exemplary segmentation andreconstruction of a new multimedia video presentation according to themethod of the present invention.

FIG. 34 is a block diagram illustrating an exemplary edited multimediafile according to the present invention.

FIG. 35 is a flowchart of an exemplary method of the present inventionfor virtual video editing based on metadata.

FIG. 36 is an exemplary pseudocode implementation of the method of thepresent invention.

FIG. 37 is an exemplary pseudocode implementation of the method of thepresent invention.

FIG. 38 is an exemplary pseudocode implementation of the method of thepresent invention.

FIG. 39 is an exemplary pseudocode implementation of the method of thepresent invention.

FIG. 40 is an exemplary pseudocode implementation of the method of thepresent invention.

FIG. 41 is an exemplary pseudocode implementation of the method of thepresent invention.

FIG. 42 is a block diagram illustrating an exemplary virtual videoeditor of the present invention.

FIG. 43 is a block diagram illustrating an exemplary transcoding methodof the present invention without SRR value.

FIG. 44 is a block diagram illustrating an exemplary transcoding methodof the present invention with SRR value.

FIG. 45 is a block diagram illustrating an exemplary content transcoderof the present invention.

FIG. 46 is a block diagram illustrating an exemplary adaptive widowfocusing method of the present invention.

FIG. 47 is a block diagram and table illustrating image nodes and edgesaccording to an exemplary method of the present invention.

FIG. 48 is a block diagram illustrating an exemplary hypershell searchmethod of the present invention.

FIG. 49 is a block diagram illustrating the contents of an embodiment ofthe video bookmark of the present invention.

FIG. 50 is a block diagram illustrating the recommendation engine of thepresent invention.

FIG. 51 is a block diagram illustrating the video bookmark process ofthe present invention in conjunction with an EPG channel.

FIG. 52 is a block diagram illustrating the video bookmark process ofthe present invention in conjunction with a network.

FIG. 53 is a block diagram of the system of the present invention.

FIG. 54 is a block diagram of an exemplary relevance queue of thepresent invention.

FIG. 55 is a timeline diagram showing an exemplary embodiment of therewind method of the present invention.

FIG. 56 is a timeline diagram showing an exemplary embodiment of therewind method of the present invention.

FIG. 57 is a flowchart showing an exemplary embodiment of the retrievalmethod of the present invention.

FIG. 58 is a flowchart showing another exemplary embodiment of theretrieval method of the present invention.

FIG. 59 is a flowchart showing another exemplary embodiment of theretrieval method of the present invention.

FIG. 60 is a block diagram illustrating a hierarchical arrangement ofimages that exemplifies a navigation method of the present invention.

FIG. 61 is a web page illustrating a web page having an exemplaryduration bar of the present invention.

FIG. 62 is a web page illustrating a web page having an exemplaryduration bar of the present invention.

FIG. 63 is a diagram illustrating an exemplary hypershell search methodof the present invention.

FIG. 64 is a diagram illustrating another exemplary hypershell searchmethod of the present invention.

FIG. 65 is a diagram illustrating another exemplary hypershell searchmethod of the present invention.

FIG. 66 is a diagram illustrating another exemplary hypershell searchmethod of the present invention.

FIG. 67 is a diagram illustrating another exemplary hypershell searchmethod of the present invention.

FIG. 68 is a block diagram illustrating an exemplary embodiment of themetadata server and metadata agent of the present invention.

FIG. 69 is a block diagram illustrating an alternate exemplaryembodiment of the metadata server and metadata agent of the presentinvention.

FIG. 70 is a timeline comparison illustrating exemplary offset recordingcapability of the present invention.

FIG. 71 is a timeline comparison illustrating alternate exemplary offsetrecording capability of the present invention.

FIG. 72 is a timeline comparison illustrating exemplary interruptrecording capability of the present invention.

FIG. 73 is a timeline comparison illustrating the exemplary disparateand sequential recording capabilities of the present invention.

While the present invention is susceptible to various modifications andalternative forms, specific exemplary embodiments thereof have beenshown by way of example in the drawings and are herein described indetail. It should be understood, however, that the description herein ofspecific embodiments is not intended to limit the invention to theparticular forms disclosed, but, on the contrary, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the invention as defined by the appended claims.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

FIG. 53 illustrates the system of the present invention. At the heart ofthe system of the present invention is a Wide Area Network 5350,exemplary or most famously embodied in the Internet. The presentinvention can be contained within the server 5314, as well as a seriesof clients such as Laptop 5322, Video Camera 5324, Telephone 5326,Digitizing Pad 5328, Personal Digital Assistance (PDA) 5330, Television5332, Set Top Box 5340 (that is connected to and serves Television5338), Scanner 5334, Facsimile Machine 5336, Automobile 5302, Truck5304, Screen 5308, Work Station 5312, Satellite Dish 5310, andCommunications Tower 5306, all useful for communications to or fromremote devices for use with the system of the present invention. Thepresent invention is particularly useful for set top boxes 5340. The settop boxes 5340 may be used as intermediate video servers for homenetworking, serving televisions, personal computers, game stations andother appliances. The server 5314 can be connected to an internal localarea network via, for example, Ethernet 5316, although any type ofcommunications protocol in a local area network or wide area network ispossible for use with the present invention. Preferably, the local areanetwork for the server 5314 has with it connections for data storage5318 which can include database storage capability. The local areanetwork connected to Ethernet 5316 may also hold one or more alternateservers 5320 for purposes of load balancing, performance, etc. Themultimedia bookmarking scheme of the present invention can utilize theservers and clients of the system of the present invention, asillustrated in FIG. 53, for use in transferring data to or loading datafrom the servers through the Wide Area Network 5350.

In general, the present invention is useful for storing, indexing,searching, retrieving, editing, and rendering multimedia content overnetworks having at least one device capable of storing and/ormanipulating an electronic file, and at least one device capable ofplaying the electronic file. The present invention provides variousmethodologies for tagging multimedia files to facilitate the indexing,searching, and retrieving of the tagged files. The tags themselves canbe embedded in the electronic file, or stored separately in, forexample, a search engine database. Other embodiments of the presentinvention facilitate the e-mailing of multimedia content. Still otherembodiments of the present invention employ user preferences and userbehavioral history that can be stored in a separate database or queue,or can also be stored in the tag related to the multimedia file in orderto further enhance the rich search capabilities of the presentinvention.

Other aspects of the present invention include using hypershell andother techniques to read text information embedded in multimedia filesfor use in indexing, particularly tag indexes. Still more methods of thepresent invention enable the virtual editing of multimedia files bymanipulating metadata and/or tags rather than editing the multimediafiles themselves. Then the edited file (with rearranged tags and/ormetadata) can be accessed in sequence in order to link seamlessly one ormore multimedia files in the new edited arrangement.

Still other methods of the present invention enable the transcoding ofimages/videos so that they enable users to display images/videos ondevices that do not have the same resolution capabilities as the devicesfor which the images/videos were originally intended. This allowsdevices such as, for example, PDA 5330, laptop 5322, and automobile5302, to retrieve useable portions of the same image/video that can bedisplayed on, for example, workstation 5312, screen 5308, and television5332.

Finally, the indexing methods of the present invention are enhanced bythe unique modification of visual rhythm techniques that are part ofother methods of the present invention. Modification of prior art visualrhythm techniques enable the system of the present invention to capturetext information in the form of captions that are embedded intomultimedia information, and even from video streams as they arebroadcast, so that text information about the multimedia information canbe included in the multimedia bookmarks of the present invention andutilized for storing, indexing, searching, retrieving, editing andrendering of the information.

1. Multimedia Bookmark

The methods of the present invention described in this disclosure can beimplemented, for example, in software on a digital computer having aprocessor that is operable with system memory and a persistent storagedevice. However, the methods described herein may also be implementedentirely in hardware, or entirely in software, and in any combinationthereof.

In general, after a multimedia content is analyzed automatically and/orannotated by a human operator, the results of analysis and annotationare saved as “metadata” with the multimedia content. The metadatausually include information on description of multimedia data contentsuch as distinctive characteristic of the data, structure and semanticsof the content. Some of the description provides information on thewhole content such as summary, bibliography and media format. However,in general, most of the description is structured around “segments” thatrepresent spatial, temporal or spatial-temporal components of theaudio-visual content. In the case of video content, the segment may be asingle frame, a single shot consisting of successive frames, or a groupof several successive shots. Low-level features and some elementarysemantic information may describe each segment. Examples of suchdescriptions include color, texture, shape, motion, audio features andannotated texts.

If it is desired to generate metadata for several variations of amultimedia content, it would be natural to generate the metadata onlyfor a single variation, called a master file, and then have the othervariations share the same metadata. This sharing of metadata would savea lot of time and effort by skipping the time-consuming andlabor-intensive work of generating multiple versions of metadata. Inthis case, the media positions (in terms of time points or bytes)contained in the metadata obtained with respect to the master file maynot be directly applied to the other variations. This is because theremay be mismatches of media positions between the master and the othervariations if the master and the other variations do not start at thesame position of the source content.

The method and system of the present invention include a tag that cancontain information about all or a portion of a multimedia file. The tagcan come in several varieties, such as text information embedded intothe multimedia file itself, appended to the end of the multimedia file,or stored separately from the multimedia file on the same or remotenetwork storage device.

Alternatively, the multimedia file has embedded within it one or moreglobal unique identifiers (GUIDs). For example, each scene in a moviecan be provided with its own GUID. The GUIDs can be indexed by a searchengine and the multimedia bookmarks of the present invention canreference the GUID that is in the movie. Thus, multiple multimediabookmarks of the present invention can reference the same GUID in amultimedia document without impacting the size of the multimediadocument, or the performance of servers handling the multimediadocument. Furthermore, the GUID references in the multimedia bookmarksof the present invention are themselves indexable. Thus, a search on agiven multimedia document can prompt a search for all multimediabookmarks that reference a GUID embedded within the multimedia file,providing a richer and more extensive resource for the user.

FIG. 2 shows a multimedia bookmark 210 of the present inventioncomprising positional information 212 and content information 214. Thepositional information 212 is used for accessing a multimedia content204 starting from a bookmarked position 206. The content information 214is used for visually displaying multimedia bookmarks in a bookmark list208, as well as for searching one or more multimedia content databasesfor the content that matches the content information 214.

The positional information 212 may be composed of a URI, a URL, or thelike, and a bookmarked position (relative time or byte position) withinthe content. For the purposes of this disclosure, a URI is synonymouswith a position of a file and can be used interchangeably with a URL orother file location identifier. The content information 214 may becomposed of audio-visual features and textual features. The audio-visualfeatures are the information, for example, obtained by capturing orsampling the multimedia content 204 at the bookmarked position 206. Thetextual features are text information specified by the user(s), as wellas delivered with the content. Other aspects of the textual features maybe obtained by accessing metadata of the multimedia content.

In one embodiment of the multimedia bookmark 210 of the presentinvention, the positional information 212 is composed of a URI and abookmarked position like an elapsed time, time code or frame number. Thecontent information 214 is composed of audio-visual features, such asthumbnail image data of the captured video frame, and visual featurevectors like color histogram for one or more of the frames. The contentinformation 214 of a multimedia bookmark 210 is also composed of suchtextual features as a title specified by a user as well as deliveredwith the content, and annotated text of a video segment corresponding tothe bookmarked position.

In the case of an audio bookmark of the present invention, thepositional information 212 is composed of a URI, a URL, or the like, anda bookmarked position such as elapsed time. Similarly, the contentinformation 214 is composed of audio-visual features such as the sampledaudio signal (typically of short duration) and its visualized image. Thecontent information 214 of an audio bookmark 210 is also composed ofsuch textual features as a title, optionally specified by a user orsimply delivered with the content, and annotated text of an audiosegment corresponding to the bookmarked position. In the case of a textbookmark 210, the positional information 212 is composed of a URI, URL,or the like, and an offset from the starting point of a text document.The offset can be of any size, but is normally about a byte in size. Thecontent information 214 is composed of a sampled text string present atthe bookmarked position, and text information specified by user(s)and/or delivered with the content, such as the title of the textdocument.

FIG. 3 shows an illustration of searching for multimedia contents thatare relevant to the content information 314 (that correlates to element214 of FIG. 2) that is stored in the multimedia bookmark 210 of FIG. 2of the present invention where both positional and content informationare used. The content information 314 is comprised of audio-visualfeatures 320 such as a captured frame 322 and a sampled audio data 324,and textual features 326 such as annotated text 328 and a title 330.There are many cases where a bookmark system that utilizes onlypositional information, such as URI and an elapsed time, such as thatused by conventional bookmarks, may not be valid. For example, if abookmark were generated during the preview of multimedia contentbroadcast, the bookmark would not be valid for viewing a full version ofthe broadcast. If a bookmark were saved during live Internet broadcast,the bookmark would not be valid for viewing an edited version of thelive broadcast. Further, if a user wanted to access the bookmarkedmultimedia content from another site that also provides the content,even the positional information such as URI would be not be valid.

To solve the problems described in the background section, the presentinvention uses content information 314 (element 214 of FIG. 2) that issaved in the multimedia bookmark to obtain the actual positionalinformation of the last-visited segment by searching the multimediadatabase 310 using the content information 314 as a query input. Contentinformation characteristics such as captured frame 322, sampled audiodata 324, annotated text of the segment corresponding to a bookmarkedposition 328, and the title delivered with the content 330 can be usedas query input to a multimedia search engine 332. The multimedia searchengine searches its multimedia database 310 by performing content-basedand/or text-based multimedia searches, and finds the relevant positionsof multimedia contents. The search engine then retrieves a list ofrelevant segments 334 with their positional information such as URI, URLand the like, and the relative position. With a multimedia player 336, auser can start playing from the retrieved segments of the contents. Theretrieved segments 334 are usually those segments having contentsrelevant or similar to the content information saved in the multimediabookmark.

FIG. 4 illustrates an embodiment of a key frame hierarchy used by asearch method of the multimedia search engine 332 (see FIG. 3) inaccordance with the present invention. The method arranges key frames ina hierarchical fashion to enable fast and accurate searching of framessimilar to a query image.

The key frame hierarchy illustrated in FIG. 4 is a tree-structuredrepresentation for multi-level abstraction of a video by key frames,where a node denotes each key frame. A number Df is associated with eachnode and represents the maximum distance between the low-level featurevector of the node 414 and those of its decendent nodes in its subtree(for example, nodes 416 and 418). An example of such feature vector isthe color histogram of a frame. If a video database composed of one ormore key frame hierarchies, which correspond to different videosequences, must be searched to find a specific query image fq, thedissimilarity between fq and a subtree rooted at the key frame fm ismeasured by testing d(fq, fm)>Df+e where d(fq, fm) is a distance metricmeasuring dissimilarity such as the L1 norm between feature vectors, ande is a threshold value set by a user. If the condition is satisfied,searching of the subtree rooted at the node fm is skipped (i.e., thesubtree is “pruned” from the search). This method of the presentinvention reduces the search time substantially by pruning out theunnecessary comparison steps.

Durable Multimedia Bookmark using Offset and Time Scale

FIG. 5 shows an example of five variations encoded from the same sourcevideo content 502. FIG. 5 shows two ASF format files 504, 506 with thebandwidths of 28.8 and 80 kbps that start and end exactly at the sametime points. FIG. 5 also shows the first RM format file 508 with thebandwidth of 80 kbps. In the RM file 508, source content starts to beencoded with the time interval o₁ before the start time point of the ASFfiles 504, 506, and ends to be encoded with the time interval o₄, beforethe end time point of the ASF files 504 and 506. The RM file 508 thushas an extra video segment with the duration of o₁ at the beginning.Consequently, compared with a start time point of a specific videosegment 514 in the ASF files, the start time point of the video segmentin the RM file is temporally shifted right with the time interval o₁.The start time point of the video segment in the RM file can be computedby adding the time interval o₁ to the start time point of the videosegment in the ASF files. Similarly, the second RM file 510 with thebandwidth of 28.8 kbps does not have a leading video segment with theduration of o₂. The start time point of the video segment 514 in thesecond RM file can be computed by subtracting the time interval o₂ fromthe start time point of the video segment in the ASF files. Also, theMOV file 512 with the smart bandwidth of 56 kbps has two extra segmentswith the duration of o₃ and o₆, respectively.

In another example, designate one of the different variations encodedwith the same source multimedia content as the master file, and theother variations as slave files. In the example illustrated in FIG. 5,the ASF file encoded at the bandwidth of 80 kbps 504 is to be the masterfile, and the other four files are slave files. In this example, anoffset of a slave file will be the difference of positions in timeduration or byte offset between a start position of a master file and astart position of the slave file. In this example, the difference ofpositions o₁, o₂, and o₃ are offsets. The offset of a slave file iscomputed by subtracting the start position of a slave file from thestart position of a master file. In this formula, the two startpositions are measured with respect to the source content. Thus, theoffset will have a positive value if the start position of a slaveoccurred before the start position of a master with reference to thesource content. Conversely, the offset will have a negative value if thestart position of a slave occurred after the start position of a master.For the example shown in FIG. 5, the offsets o₁ and o₃ are positivevalues, and o₂ is negative. Although not specifically required, byconvention an offset of a master file is set to zero.

Consider the different variations encoded from the same sourcemultimedia content. A user generates a multimedia bookmark with respectto one of the variations that is to be called a bookmarked file. Then,the multimedia bookmark is used at a later time to play one of thevariations that is called a playback file. In other words, thebookmarked file pointed to by the multimedia bookmark, and the playbackfile selected by the user, may not be the same variation, but refer tothe same multimedia content.

If there is only one variation encoded from the original content, boththe bookmarked and the playback files should be the same. However, ifthere are multiple variations, a user can store a multimedia bookmarkfor one variation and later play another variation by using the savedbookmark. The playback may not start at the last accessed positionbecause there may be mismatches of positions between the bookmarked andthe playback files.

Associated with a multimedia content are metadata containing the offsetsof the master and slave variations of the multimedia content in the formof media profiles. Each media profile corresponds to the differentvariation that can be produced from a single source content depending onthe values chosen for the encoding formats, bandwidths, resolutions,etc. Each media profile of a variation contains at least a URI and anoffset of the variation. Each media profile of a variation optionallycontains a time scale factor of the media time of the variation encodedin different temporal data rates with respect to its master variation.The time scale factor is specified on a zero to one scale where a valueof one indicates the same temporal data rate, and 0.5 indicates that thetemporal data rate of the variation is reduced by half with respect tothe master variation.

Table 1 is an example metadata for the five variations in FIG. 5. Themetadata is written according to the ISO/IEC MPEG-7 metadata descriptionstandard which is under development. The metadata are described by XMLsince MPEG-7 adopted XML Schema as its description language. In thetable, the offset values of the three variations 508, 510, 512 areassumed to be o₁=2, o₂=−3, and o₃=10 seconds, respectively. Also, thetemporal data rate of the variation 512 is assumed to be reduced by halfwith respect to the master variation 504, and the other variations arenot temporally reduced. TABLE 1 An example of Metadata Description forFive Variations <VariationSet>  <Source>   <Video>    <MediaLocator>    <MediaUri>http://www.server.com/sample-80.asf</MediaUri>   </MediaLocator>   </Video>  </Source>  <Variation timeOffset=“PT0S”timeScale=“1”>   <Source>    <Video>     <MediaLocator>     <MediaUri>http://www.server.com/sample-28.asf</      MediaUri>    </MediaLocator>    </Video>   </Source>  <VariationRelationship>alternativeMediaProfile</  VariationRelationship>  </Variation>  <Variation timeOffset=“PT3S”timeScale=“1”>   <Source>    <Video>     <MediaLocator>     <MediaUri>http://www.server.com/sample-80.rm</MediaUri>    </MediaLocator>    </Video>   </Source>  <VariationRelationship>alternativeMediaProfile</  VariationRelationship>  </Variation>  <Variation timeOffset=“-PT2S”timeScale=“1”>   <Source>    <Video>     <MediaLocator>     <MediaUri>http://www.server.com/sample-28.rm</MediaUri>    </MediaLocator>    </Video>   </Source>  <VariationRelationship>alternativeMediaProfile</  VariationRelationship>  </Variation>  <Variation timeOffset=“PT10S”timeScale=“0.5”>   <Source>    <Video>     <MediaLocator>     <MediaUri>http://www.server.com/sample-56.mov</      MediaUri>    </MediaLocator>    </Video>   </Source>  <VariationRelationship>alternativeMediaProfile</  VariationRelationship>  <VariationRelationship>temporalReduction</VariationRelationship> </Variation> </VariationSet>

FIG. 6 shows an example of two multimedia contents and their associatedmetadata. Since the first multimedia content has five variations and thesecond has three variations, there are five media profiles in themetadata of the first multimedia content 602, and three media profilesin the metadata of the second 604. In FIG. 6, two subscripts attached toidentifiers of variations, URIs, URLs or the like, and offsets representa specific variation of a multimedia content. For example, the thirdvariation of the first multimedia content 610 has the associated mediaprofile 612 in the metadata of the first multimedia content 602. Themedia profile 612 provides the values of a URI and an offset of thethird variation of the first multimedia content 610.

When a user at the client terminal wants to make a multimedia bookmarkfor a multimedia content having multiple variations, the following stepsare taken. First, the user selects one of several variations of themultimedia content from a list of the variations and starts to play theselected variation from the beginning. When the user makes a multimediabookmark on the selected variation, which now becomes a bookmarked file,a bookmark system stores the following positional information along withcontent information in the multimedia bookmark:

a. A URI of the bookmarked file;

b. A bookmarked position within the bookmarked file; and

c. A metadata identification (ID) of the bookmarked file.

The metadata ID may be a URI, URL or the like of the metafile or an IDof the database object containing the metadata. The user then continuesor terminates playing of the variation.

FIG. 7 shows an example of a list of bookmarks 702 for the variations oftwo multimedia contents in FIG. 6. The list contains the first andsecond bookmarks 704 and 706 for the first variation, and the third one708 for the fourth variation of the first multimedia content. Becausethose three bookmarks are for the same multimedia content, they alsohave the same metadata ID. The list also contains the fourth and fifthbookmarks 710 and 712 for the first and third variations of the secondmultimedia content, respectively. Thus, these two bookmarks have thesame metadata ID referring to the second multimedia content.

When a user wants to play the multimedia content from a saved bookmarkposition, the following steps are taken. The user selects one of thesaved multimedia bookmarks from the user's bookmark list. The user canalso select a variation from the list of possible variations. Theselected variation now becomes a playback file. The bookmark system thenchecks whether the selected bookmarked file is equal to the playbackfile or not. If they are not equal, the bookmark system adjusts thesaved bookmarked position in order to obtain an accurate playbackposition on the playback file. This adjustment is performed by using theoffsets saved in a metafile and a bookmarked position saved in amultimedia bookmark. Assume that P_(b) is a bookmarked position of abookmarked file, and P_(p) is the desirable position (adjusted bookmarkposition) of the playback file. Also, let o_(b) and o_(p) be the offsetsof bookmarked and playback files, respectively. Further, let s_(b) ands_(p) be the time scale factors of bookmarked and playback files,respectively, and s=s_(p)/s_(b) be a time scale ratio which converts amedia time of a bookmarked file into the media time with respect to aplayback file by multiplying the ratio to the media time of thebookmarked file. Then, the P_(p) can be computed using the followingformula:P _(P) =s×P _(b) if o_(p)=s×o_(b)  i)P _(p) =s×P _(b)+(|o _(p) |+|s×o _(b)|) if o_(p)>0>s×o_(b)  ii)P _(p) =s×P _(b)+(|o _(p) −s×o _(b)|) if o_(p)>s×o_(b)≧0 or0≧o_(p)>s×o_(b)  iii)P _(p) =s×P _(b)−(|o _(p) |+|s×o _(b)|) if o_(p)<0<s×o_(b)  iv)P _(p) =s×P _(b)−(|o _(p) −s×o _(b)|) if 0≦o_(p)<s×o_(b) oro_(p)<s×o_(b)≦0.  v)

FIG. 8 shows the five distinct cases (802, 804, 806, 808, 810)illustrating the above formula. In FIG. 8, both the time scale factorsof bookmarked and playback files are assumed to be the same, thus makingthe time scale ratio be one, that is, s=1. In the above example, oneoffset is assumed for each slave file. In general, however, there may bea list of offset values for each slave file for the cases where theframe skipping occurs during the encoding of the slave file or the partof the slave file is edited.

This durable multimedia bookmark is to be explained with the examples inFIGS. 6 and 7. Suppose that a user wants to play back the thirdvariation 610 of the first multimedia content in FIG. 6 from theposition stored in the second bookmark 706 in FIG. 7. The secondbookmark 706 was made with reference to the first variation 606 of thefirst multimedia content in FIG. 6. Note that the bookmarked file 606 isnot equal to the playback file 610. Using the metadata ID saved in thebookmark, the bookmark system accesses the metadata of the firstmultimedia content 602. From the metadata, the system reads the mediaprofile of the first variation 608 and the third variation 612. Usingthe offsets saved in the two profiles and a bookmarked position saved ina multimedia bookmark, the system adjusts the bookmarked position, thusobtaining a correct playback position of a playback file.

Offset Computation

In FIG. 5, an offset of a slave file is defined as the differencebetween the start position of a master file and the start position of aslave file. This offset calculation requires locating a referentialsegment, for example, the segment A 514 in FIG. 5. After aligning thestart position of the referential segment from a master file with thestart position of the same referential segment from a slave file, theoffset is calculated as the start time of the master file minus thestart time of the slave file.

A referential segment may be any multimedia segment bounded by twodifferent time positions. In practice, however, a segment boundedbetween two specific successive shot boundaries in the case of a videois frequently used as a referential segment. Thus, the following methodmay be used to determine a referential segment:

-   -   1. Locate the first two shot boundaries from the beginning of        each of the master and the slave file using a technique of shot        boundary detection;    -   2. Check whether the starting frame at the first shot detected        from the master file is visually similar to the corresponding        frame detected from the slave file using a content-based        frame/video matching technique. Check whether the same is true        for the ending frames of the shots, too; and    -   3. Determine the segment satisfying the conditions in 1) and 2)        and let it be the referential segment.        The method of choosing a referential segment is not limited to        the procedure mentioned above. There may be other procedures        within the framework of the above method of automatic detection        of a referential segment and computation of an offset based on        the referential segment detected.

User Interface and Flow Chart

FIG. 9 shows an example of a user interface incorporating the multimediabookmark of the present invention. The user interface 900 is composed ofa playback area 912 and a bookmark list 916. Further, the playback area912 is also composed of a multimedia player 904 and a variation list910. The multimedia player 904 provides various buttons 906 for normalVCR (Video Cassette Recorder) controls such as play, pause, stop, fastforward and rewind. Also, it provides another add-bookmark controlbutton 908 for making a multimedia bookmark. If a user selects thisbutton while playing a multimedia content, a new multimedia bookmarkhaving both positional and content information is saved in a persistentstorage. Also, in the bookmark list 916, the saved bookmark is visuallydisplayed with its content information. For example, a spatially reducedthumbnail image corresponding to the temporal location of interest savedby a user in the case of a multimedia bookmark is presented to help theuser to easily recognize the previously bookmarked content of the video.

In the bookmark list 916, every bookmark has five bookmark controls justbelow its visually displayed content information. The left-mostplay-bookmark control button 918 is for playing a bookmarked multimediacontent from a saved bookmarked position. The delete-bookmark controlbutton 920 is for managing bookmarks. If this button is selected, thecorresponding bookmark is deleted from the persistent storage. Theadd-bookmark-title control button 922 is used to input a title ofbookmark given by a user. If this button is not selected, a defaulttitle is used. The search control button 924 is used for searchingmultimedia database for multimedia contents relevant to the selectedcontent information 914 as a multimedia query input. There are a varietyof cases when this control might be selected. For example, when a userselects a play-bookmark control to play a saved bookmark, the user mightfind out that the multimedia content being played is not in accordancewith the displayed content information due to the mismatches ofpositional information for some reason. Further, the user might want tofind multimedia contents similar to the content information of the savedbookmark. The send-bookmark control button 926 is used for sending bothpositional and content information saved in the corresponding bookmarkto other people via e-mail. It should be noted that the positionalinformation sent via e-mail includes either a URI or other locator, anda bookmarked position.

For durable bookmarks, the variation list 910 provides possiblevariations of a multimedia content with corresponding check boxes.Before a traditional normal playback or a bookmarked playback, a userselects a variation by checking the corresponding mark. If themultimedia content does not have multiple variations, this list may notappear in the user interface.

FIG. 10 is an exemplary flow chart illustrating the overall method 1000of saving and retrieving multimedia bookmarks with the two additionalfunctions: i) Searching for other multimedia content relevant to thecontent pointed by the bookmark and ii) Sending a bookmark to anotherperson via e-mail. In the multimedia process, step 1002, if a user wantsto play the multimedia content (step 1004), the multimedia player isfirst displayed to the user in step 1006. A check is made in step 1008to determine if multiple variations of multimedia content are available.If so, then two extra steps are taken. In step 1010, the variation listis presented to the user and (optionally) with a default variation instep 1012. Thereafter, in step 1014, the list of multimedia bookmarks isdisplayed to the user by using their content information and bookmarkcontrols. In a select control, step 1016 is performed. A check is madeto determine if the user wants to change the variation, step 1018. Ifso, the user can select the other variation, step 1020. Thereafter, instep 1022, a check is made to determine if the user has selected one ofthe conventional VCR-type controls (e.g., play, pause, stop, fastforward, and rewind) or one of the bookmark-type controls (add-bookmark,play-bookmark, delete-bookmark, add-bookmark-title, search, andsend-bookmark). If the user selects a conventional control button, theexecution of the method jumps to the selected function 1024. Otherwise,if the user selects one of the controls related to the bookmarks (1026,1030, 1034, 1038, 1042, and 1046), the program goes to the correspondingroutine (1028, 1032, 1036, 1040, 1044, and 1048), respectively. Untilthe different multimedia content is selected (step 1004), the multimediaplayer with the variation list and the bookmark list will continue to bedisplayed (steps 1006, 1010 and 1014).

FIG. 11 is a flow chart illustrating the process of adding a multimediabookmark. When the add-bookmark control is selected (step 1026 of FIG.10), execution of the method proceeds to step 1028 of FIG. 11. In thisportion 1100 of the method of the present invention, the multimediaplayback is suspended in step 1102. Then, the URI, URL or similaraddress is obtained in step 1104. A check is made in step 1106 todetermine if the information on the bookmarked position such as timecode is available at the currently suspended multimedia content. If so,execution is moved to step 1108, where the bookmarked position isobtained. In step 1110, the bookmarked position data, if available, areused to capture, sample or derive audio-visual features of the suspendedmultimedia content at the bookmarked position. In step 1112, a check ismade to determine if the metadata exists. If not, then execution jumpsto step 1124 where the URI (or the like), the bookmarked position, andthe audio-visual features are stored in persistent storage. Otherwise(i.e., the metadata of the suspended multimedia content exist), thesearch is conducted to find a segment corresponding to the bookmarkedposition in the metadata in step 1114. Next, a check is made todetermine if the annotated text is available for the segment. If so,then the annotated text is obtained in step 1118. If not, step 1118 isskipped and execution resumes at step 1120, where a check is made todetermine if there are media profiles that contain offset values of thesuspended multimedia content. If so, step 1122 is performed where ametadata ID is obtained in order to adjust the bookmarked position infuture playback. Otherwise, step 1122 is skipped and the method proceedsdirectly to step 1124, where the annotated text and the metadata ID arealso stored in persistent storage. Then, in step 1126, the list ofmultimedia bookmarks is redisplayed with their content information andbookmark controls. The multimedia playback is resumed in step 1128, andexecution of the method is moved to a clearing-off routine 1610 (of FIG.16) that is performed at the end of every bookmark control routine.

In the clearing-off routine 1610, illustrated in FIG. 16, a check ismade in step 1612 to determine if the user wants to play back differentmultimedia content. If so, the method returns to step 1002 (see FIG. 10)where another multimedia process begins. Otherwise, the method resumesat step 1016 of FIG. 10, where the multimedia process waits for the userto select one of the conventional VCR or bookmark controls.

FIG. 12 is a flow chart illustrating the process of playing a multimediabookmark. When the play-bookmark control is selected by the user in step1030 (see FIG. 10), step 1032 is invoked. In step 1202 (see FIG. 12),the URI or the like, bookmarked position, and metadata ID for themultimedia content to be played back are read from persistent storage. Acheck is made in step 1204 to determine if the URI of the content isvalid. If not, execution of the method is shifted to step 1044 (see FIG.10) where the process of the content-based and/or text-based searchbegins. The URI of the content becomes invalid when the multimediacontent is moved to other location, for example. If the URI of thecontent is valid (the result of step 1204 is positive), a check is madeto determine if the bookmarked position is available. If not, a check ismade to determine if the user desires to select the content-based and/ortext-based search in step 1208. If so, execution is moved to step 1044(see FIG. 10). Otherwise, the method moves to step 1210, where the usercan just play the multimedia content from the beginning. If the URI ofthe content is valid and the bookmarked position is available (e.g.,both results of steps 1204 and 1206 are positive), a check is made instep 1212 to determine if the metadata ID is available. If it is notavailable, the multimedia playback starts from the bookmarked positionin step 1222. Otherwise, the bookmarked and playback files areidentified in step 1214 and the values of their respective offsets areread from the metadata in step 1216. Then, in step 1218, the bookmarkedposition is adjusted by using offsets. The multimedia playback startsfrom the adjusted bookmarked position in step 1220. After starting oneof the playbacks (1210, 1220, or 1222), the method executes theclearing-off routine in step 1610 of FIG. 16.

FIG. 13 is a flow chart illustrating the process of deleting amultimedia bookmark. When the delete-bookmark control is selected (step1034 of FIG. 10), the method invokes the routine illustrated in FIG. 13.In this particular portion 1300 of the method of the present invention,all positional and content information of the selected multimediabookmark is deleted from the persistent storage in step 1302. Then, thelist of multimedia bookmarks is redisplayed with their contentinformation and bookmark controls in step 1304, and then execution isshifted to the clearing-off routine, step 1610 of FIG. 16.

FIG. 14 is a flow chart illustrating the process of adding a title to amultimedia bookmark. When the add-bookmark-title control is selected(step 1038 of FIG. 10), the program goes through this portion 1400 ofthe method of the present invention. In this routine, the user will beprompted to enter a title in step 1402 for the saved multimediabookmark. A check is made to determine if the user entered a title instep 1404. If not, the program may provide a default title in step 1406that may be made in accordance with a predetermined routine. In anycase, execution proceeds to step 1408, where the list of multimediabookmarks is redisplayed with their content information, including thetitles and bookmark controls. Thereafter, the method executes theclearing-off routine of step 1610 of FIG. 16.

FIG. 15 is a flow chart illustrating the portion 1500 of the presentinvention for searching for the relevant multimedia content based onaudio-visual features as well as textual features saved in a multimediabookmark, if available. The search methods currently available can belargely categorized into two types: content-based search and text-basedsearch. Most of the prior art search engines utilize a text-basedinformation retrieval technique. The present invention also employscontent-based multimedia search engines which use, for example, theretrieval technique based on such visual and audio characteristics orfeatures as color histogram and audio spectrum. The content informationof a particular segment, stored in a multimedia bookmark, may be used tofind other relevant information about the particular segment. Forexample, a frame-based video search may be employed to find other videosegments similar to the particular video segments.

Alternatively, a text-based search may be combined with a frame-basedvideo search to improve the search result. Most of frame-based videosearch methods are based on comparing low-level features such as colorsand texture. These methods lack semantics necessary for recognition ofhigh-level features. This limitation may be overcome by combining atext-based search. Most available multimedia contents are annotated withtext. For example, video segments showing President Clinton may beannotated with “Clinton.” In that case, the combined search using theimage of Clinton wearing a red shirt as a bookmark may find other videosegments containing Clinton, such as the segment showing Clinton wearinga blue shirt.

When the user selects as a query input a particular bookmark or partialsegment of the multimedia content such as a thumbnail image in the caseof a video search, the search routine (1044 of FIG. 15) is invoked inthe following three scenarios:

-   -   i. The user selects search control (step 1042 of FIG. 10) in        order to retrieve the multimedia content relevant to the query;    -   ii. The URI of the bookmarked multimedia content is not valid        (the result of step 1204 of FIG. 12 is negative); and    -   iii. The URI of the bookmarked multimedia content is valid, but        the bookmarked position is not available (the result of step        1206 of FIG. 12 is negative and the result of step 1208 is        positive).

Once invoked, this portion 1500 is invoked and the content informationof the multimedia bookmark such as audio-visual and textual features ofthe query input and the positional information, if available, are readfrom persistent storage in step 1502. Examples of visual features forthe multimedia bookmark include, but are not limited to, captured framesin JPEG image compression format or color histograms of the frames.

In step 1504, a check is made to determine if the annotated texts areavailable. If so, the annotated text is retrieved directly from thecontent information of the bookmark in step 1506 and execution proceedsimmediately to step 1516, where the process of the text-based multimediasearch is performed by using the annotated texts as query input,resulting in the multimedia segments having texts relevant to the query.If the result of step 1504 is negative, the annotated texts can be alsoobtained by accessing the metadata, using the positional information.Thus a check is made in step 1508 to determine if the positionalinformation is available. If so, then another check is made to determineif the metatdata exist in step 1510. If so (i.e., the result of step1510 is positive), step 1512 is executed, where a segment correspondingto the bookmarked position in the metadata is found. A check is thenmade to determine if some annotated texts for the segment are availablein step 1514. If so (i.e., the result of step 1514 is positive), thetext-based multimedia search is also performed in step 1516. If theannotated texts or the positional information is not available from thecontent information of the bookmark (i.e., the result of step 1514 isnegative) or from the metadata (i.e., the result of step 1510 isnegative), then a content-based multimedia search is performed by usingthe audio-visual features of the bookmark as query input in step 1518.The result of step 1518 is that the resulting multimedia segments haveaudiovisual features similar to the query. It should be noted that boththe text-based multimedia search (step 1516) and the content-basedmultimedia search (step 1518) can be performed in sequences, thuscombining their results. Alternatively, one search can be performedbased the results of the other search, although they are not presentedin the flow chart of FIG. 15.

The audio-visual features of the retrieved segments at their retrievedpositions are computed in step 1520 and temporarily stored to showvisually the search results in step 1522, as well as to be used as queryinput to another search if desired by the user in steps 1530, 1532, and1534. If the user wants to play back one of the retrieved segments,i.e., the result of step 1524 is positive, the user selects a retrievedsegment in step 1526, and plays back the segment from the beginning ofthe segment in step 1528. The beginning of the retrieved segment thatwas selected is called as the retrieved position in either step 1528 orstep 1508. If the user wants another search (i.e., the result of step1530 is positive), the user selects one of retrieved segments in step1532. Then, the content information, including audio-visual features andannotated texts for the selected segment, is obtained by accessingtemporarily stored audio-visual features and/or the correspondingmetadata in step 1534, and the new search process begins at step 1504.If the user wants no more playbacks and searches, the execution istransferred to the clearing-off routine, step 1610 of FIG. 16.

Depending on the kind of information available in the multimediabookmark, there can be a handful of client-server-based searchscenarios. An excellent example is the multimedia bookmarks of thepresent invention. With the combination of the multimedia bookmarkinformation tabulated in Table 2, some examples of theclient-server-based search scenario are described. Note that even if thetext-based search is used in the description of the present invention, auser does not type in the keywords to describe the video that the userseeks. Moreover, the user might be unaware of doing text-based search.The present invention is designed to hide this cumbersome process ofkeyword typing from the user. TABLE 2 Search types with availablebookmark information Available bookmark information Search CapturedPositional Annotated Type Image Info. Text A √ B √ C √ D √ √ E √ √ F √ √G √ √ √

Search Type A: The multimedia bookmark has only information on image.

-   -   1. When a user at a client side selects a bookmarked image, the        client sends the image data to the server as a query frame.    -   2. The server finds the segment containing the query frame using        a frame-based video search.    -   3. The server checks if the segment has annotated text. If so,        go to step 4. Otherwise, provide the user with the result of the        frame-based video search and terminate.    -   4. The server performs a text-based video search using the        annotated text as keywords.    -   5. Provide the user with the combined results of the frame-based        search in step 2 and the text-based search in step 4.

Search Type B: The multimedia bookmark has only positional information.

-   -   1. When a user at a client side selects a multimedia bookmark,        the client sends the position information about the image to the        server.    -   2. The server performs a frame-based video search, using as a        query frame the frame corresponding to the specified position.    -   3. The server checks if the segment at the specified position        has annotated text. If so, go to step 4. Otherwise, provide the        user with the result of the frame-based video search and        terminate.    -   4. The server performs a text-based video search using the        annotated text as keywords.    -   5. Provide the user with the combined results of steps 2 and 4.

Search Type C: The multimedia bookmark has only annotated text. When asever at a client side selects a multimedia bookmark, the client sendsthe annotated text to the server.

-   -   1. The server performs a text-based video search using the        annotated text as keywords.    -   2. Provide the user with the result of step 2.

Search Type D: The multimedia bookmark has both image and positionalinformation. This type of search can be implemented in the way of eitherSearch Type A or B.

Search Type E: The multimedia bookmark has both image and annotatedtext.

1. When a user at a client side selects a bookmark image, the clientsends the image data and the annotated text to the server.

2. The server performs a frame-based video search using the image as aquery image.

3. The server performs a text-based video search using the annotatedtexts as search keywords. Note that the execution order of steps 2 and 3can be switched.

4. Provide the user with the combined results of steps 2 and 3.

Search Type F: The multimedia bookmark has both positional informationand annotated text.

-   -   1. When a user at a client side selects a multimedia bookmark,        the client sends the positional information and the annotated        texts to the server;    -   2. The server performs a frame based video search, using the        frame corresponding to the specified position as a query frame.    -   3. The server performs a text-based video search using the        annotated texts as search keywords. Note that the execution        order of steps 2 and 3 can be switched.    -   4. Provide the user with the combined results of steps 2 and 3.

Search Type G: The multimedia bookmark has all the information: image,position, and annotated text. This type of search can be implemented inthe way of either Search Type E or F.

FIG. 16 is a flow chart illustrating the method of sending a bookmark toother people via e-mail. When the send-bookmark control is selected(step 1046 of FIG. 10), step 1048 of FIG. 16 is invoked. According tothe method of FIG. 16, all saved bookmark information, including theURI, the bookmarked position and metadata ID, the audio-visual and thetextual features of a selected multimedia bookmark to be sent, are readfrom the persistent storage in step 1602. Then, in step 1604, the userwill be prompted to enter some related input in order to send an e-mailto another individual or a group of people. If all of the necessaryinformation is input by the user in step 1606, the e-mail is sent to thedesignated persons with the bookmark information in step 1608. At thispoint, the method goes into the clearing-off routine, step 1610, thatmay be entered from several other portions of the method shown in FIGS.11, 12, 13, 14, and 15. As shown in FIG. 16, a check is made in step1612 to determine if other multimedia contents are available. If so,execution of the method is transferred to step 1002 of FIG. 10.Otherwise, execution of the method is transferred to step 1016 of FIG.10.

The multimedia bookmark may consist of the following bookmarkedinformation:

-   -   1. URI of a bookmarked file;    -   2. Bookmarked position;    -   3. Content information such as an image captured at a bookmarked        position;    -   4. Textual annotations attached to a segment which contains the        bookmarked position;    -   5. Title of the bookmark;    -   6. Metadata identification (ID) of the bookmarked file;    -   7. URI of an opener web page from which the bookmarked file        started to play; and    -   8. Bookmarked date.        The bookmarked information includes not only positional (1 and        2) and content information (3, 4, 5, and 6) but also some other        useful information, such as opener web page and bookmarked date,        etc.

The content information can be obtained at the client or server sidewhen its corresponding multimedia content is being played in networkedenvironment. In case of a multimedia bookmark, for example, the imagecaptured at a bookmarked position (3) can be obtained from a user'svideo player or a video file stored at a server. The title of a bookmark(5) might be obtained at a client side if a user types in his own title.Otherwise, a default title, such as a title of a bookmarked file storedat a server, can be used as the title of the bookmark. The textualannotations attached to a segment which contains the bookmarked positionare stored in a metadata in which offsets and time scales of variationsalso exist for the durable bookmark. Thus, the textual annotations (4)and metadata ID (6) are obtained at a server.

The bookmarked information can be stored at a client's or server'sstorage regardless of the place where the bookmarked information isobtained. The user can send the bookmarked information to others viae-mail. When the bookmarked information is stored at a server, it issimple to send the bookmarked information via e-mail, that is, to sendjust a link of the bookmarked information stored at a server. But, whenthe bookmarked information is stored at a user's storage, the user hasto send all of the information to another via e-mail. The deliveredbookmarked information can then be stored at the receiver's storage, andthe bookmarked multimedia content starts to play exactly from thebookmarked position. Also, the bookmarked multimedia content can bereplayed at any time the receiver wants.

Some content information of the bookmarked information, such as acaptured image, is also multimedia data, and all the other information,including the positional information is textual data. Both forms of thebookmarked information stored at a user's storage are sent to otherperson within a single e-mail. There can be two possible methods ofsending the information from one user to another user via an e-mail:

-   -   1. Using the watermarking technology: All textual information        can be encoded into the content information. For the case of        multimedia bookmark, all textual information such as a URL of a        video file and a bookmarked position expressed as a time code        can be encoded into an thumbnail image captured at the        bookmarked position. According to the watermarking technology,        the image encoded with the texts can be visually almost the same        as the original image. The image encoded with the texts can be        attached to any e-mail message. The image delivered with the        message can then be decoded, and the separated image and the        texts be saved at a receiver's storage.

2. Using an HyperText Markup Language (HTML) document: An HTML documentcan be sent via e-mail. All textual parts of bookmarked information canbe directly included in the HTML document to be sent via e-mail. But thecaptured image in case of a multimedia bookmark cannot be directlyincluded in the HTML from which the included image will be detached andstored at a receiver's local storage. This is because the image isrepresented in a binary file format. Sending the binary image within anHTML document can be possible by converting the binary image into a textstring with encoders, such as Base-16 or Base-64, and directly includingit in an HTML document as a normal character string. The converted imageis called as an inline media by which one can locate any multimedia filein an HTML document. When the HTML is sent to another user, the includedtext image is decoded into a binary image, thus being saved anddisplayed at the user's storage and screen, respectively. The receivinguser may not view the detailed information, but can play the multimediacontent from the bookmarked position. Table 3 is a sample HTML documentwhich includes both the captured content image and the last of thetextual bookmarked information. TABLE 3 An example of HTML documentholding bookmarked information <Html>  <Body>   <Object id=“IMDisplay”codebase=http://www.server.com/BookmarkViewer   classid=CLSID:FFD1F137-722C-46B7 VIEWASTEXT>    <Paramname=“BookmarkedFile” value=“mms://www.server.com/sample.mpg”>    <Paramname=“BookmarkedPosition” value=“435.78705499999995”>    <Paramname=“OpenerURL” value=“http://www.server.com/sample..html”>    <Paramname=“BookmarkTitle” value=“Sample Title”>    <Param name=“BookmarkDate”value=“July 24”>    <!-- Inline media: character coded binary image -->   <Param name=“CapturedImage”    value=“/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMUFRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQ UFBQUFBQUF/. . . .xXluhEakJ9+7Db8blCELwzAvsfiP4htpVE9yHtY12pawxwoI0MqyFUwhCrjeoUAAB8AYGD4lRR7Fdyrva59E6f+0F4s0HV7bXNHvDp2twwJb29zb29vGsK7JUkEapEMKyugKsSD5eW3fKEx/GfxZ8WfEOx0W28SarJq0GkI0NgJ4o1aGNipZA4UMV+UEAk4xk58OooVL10uFz//2Q==” >  </Object>  </Body> </Html>

FIG. 17 is an exemplary flow chart illustrating the process of saving amultimedia bookmark at a receiving user's local storage. When a userinvokes his e-mail program in step 1704, the user selects a message toread in step 1706. A check is made in step 1708 to determine if themessage includes a multimedia bookmark. If not, execution is moved tostep 1706 where the user selects another message to read. Otherwise,another check is made in step 1710 to determine if the user wants toplay the multimedia bookmark by selecting a play control button, whichappears within the message. If not, execution is also moved to step1706, where the user selects another message to read. Otherwise, in step1712, a multimedia bookmark program having such a user interfaceillustrated in FIG. 9 is invoked. In step 1714, the delivered bookmarkinformation included in the message is saved at the user's persistentstorage, thus adding the delivered multimedia bookmark into the user'slist of local multimedia bookmarks. Then, in step 1716, contentinformation of the saved multimedia bookmark can appear at themultimedia bookmark program. Next, the play-bookmark control isinternally selected in step 1718. Execution is then moved to step 1032of FIG. 12.

Sending Messages to Mobile Devices

Short Message Service (SMS) is a wireless service enabling thetransmission of short alphanumeric messages to and from mobile phones,facsimile machines, and/or IP addresses. The method of the presentinvention, which provides for sending a multimedia bookmark of thepresent invention between an IP address and a mobile phone, and alsobetween mobile phones and other mobile phones, is based on the SMSarchitecture and technologies.

FIG. 18 illustrates the basic elements of this embodiment of the presentinvention. Specifically, the video server VS 1804 of the server network1802 is responsible for streaming video over wired or wireless networks.The server network 1802 also has the video database 1806 that isoperably connected to the video server 1804.

The multimedia bookmark message service center (VMSC) 1818 acts as astore-and-forward system that delivers a multimedia bookmark of thepresent invention over mobile networks. The multimedia bookmark sent bya user PC 1810, either stand-alone or part of a local area network 1808,is stored in VMSC 1818, which then forwards it to the destination mobilephone 1828 when the mobile phone 1828 is available for receivingmessages.

The gateway to the mobile switching center 1820 is a mobile network'spoint of contact with other networks. It receives a short message like amultimedia bookmark from VMSC and requests the HLR about routinginformation, and forwards the message to the MSC near to the recipientmobile phone.

The home location register (HLR) 1822 is the main database in the mobilenetwork. The HLR 1822 retains information about the subscriptions andservice profile, and also about the routing information. Upon therequest by the GWMSC 1820, the HLR 1822 provides the routing informationfor the recipient mobile phone 1828 or personal digital assistant 1830.The mobile phone 1828 is typically a mobile handset. The PDA 1830includes, but is not limited to, small handheld devices, such as aBlackberry, manufactured by Research in Motion (RIM) of Canada.

The mobile switching center 1824 (MSC) switches connections betweenmobile stations or between mobile stations and other telephone and datanetworks (not shown).

Sending a Multimedia Bookmark to a Mobile Phone from a PC

FIG. 19 illustrates the method of the present invention for sending amultimedia bookmark from a personal computer to a mobile telephone overa mobile network. In step 1 of FIG. 19, the personal computer submits amultimedia bookmark to the VMSC 1918. Next, in step 2, the VMSC 1918returns an acknowledgement to the PC 1910, indicating the reception ofthe multimedia bookmark. In step 3, the VMSC 1918 sends a request to theHRL 1922 to look up the routing information for the recipient mobile.Then the HRL 1922 sends the routing information back to the VMSC 1918,step 4. In step 5, the VMSC 1918 invokes the operation to send themultimedia bookmark to the MSC 1924. Then, in step 6, the MSC deliversthe multimedia bookmark to the mobile phone 1928. In step 7, the mobilephone 1928 returns an acknowledgement to the MSC 1924. Then in step 8,the MSC 1924 notifies the VMSC 1918 of the outcome of the operationinvoked in step 5. Incidentally, the method described above is equallyapplicable to personal digital assistants that are connected to mobilenetworks.

Sending a Multimedia Bookmark to a Mobile Phone from Another MobilePhone

FIG. 20 illustrates an alternate embodiment of the present inventionthat enables the transmission of a multimedia bookmark from one mobiledevice to another. Referring to FIG. 20, the method begins at step 1,where the mobile phone 2028 submits a request to the MSC 2024 to send amultimedia bookmark to another mobile telephone customer. In step 2, theMSC 2024 sends the multimedia bookmark to the VMSC 2018. Thereafter, instep 3, the VMSC 2018 returns an acknowledgement to the MSC 2024. Instep 4, the MSC 2024 returns to the sending mobile phone 2028 anacknowledgement indicating the acceptance of the request. In step 5, theVMSC 2018 queries the HLR 2022 for the location of the recipient mobilephone 2030. It should be noted that the sender or the recipient need notbe a mobile telephone. The sending and/or receiving device could be anydevice that can send or receive a signal on a mobile network. In step 6of FIG. 20, the HLR 2022 returns the identity of the destination MSC2024 that is close to the recipient device 2030. Then the VMSC 2018delivers the multimedia bookmark to the MSC 2024 in step 7. Then, instep 8, the MSC 2024 delivers the multimedia bookmark to the recipientmobile device 2030. In step 9, the mobile device 2030 returns anacknowledgement to the MSC 2024 for the acceptance of the multimediabookmark. Finally, in step 10, the MSC 2024 returns to the VMSC 2018 theoutcome of the request (to send the multimedia bookmark).

Playing Video on a Mobile Handset or other Mobile Device

FIG. 21 illustrates an alternate embodiment of the present invention forplaying video sequences on a mobile device. Specifically, the methodbegins generally at step 1, where the mobile device 2128 submits arequest to the MSC 2124 to play the video associated with the multimediabookmark. In step 2, the MSC 2128 sends the request with the multimediabookmark to the VMSC 2118. It is often the case that the video pointedto by the multimedia bookmark cannot be streamed directly to the mobiledevice 2128. For example, if the marked video that is in high bit rateformat is to be transmitted to the mobile device 2128, then the high bitrate video data might not be delivered properly due to the limitedbandwidth available. Further, the video might not be properly decoded onthe mobile device 2128 due to the limited computing resources on themobile device. In that case, it is desirable to deliver a low bit rateversion of the same video content to the mobile device 2128. However, aproblem occurs when the position specified by the multimedia bookmarkdoes not point to the same content for the low bit rate video. To solvethe problem, prior to relaying the request to VS 2104, the VMSC 2118decides which bit rate video is the most suitable for the current mobiledevice 2128. The VMSC 2118 also calculates the new marked location tocompensate for the offset value due to the different encoding format ordifferent frame rate needed to display the video on the mobile device2128. After completing this internal decision and computation, in step3, the VMSC 2118 sends the modified multimedia bookmark to the videoserver 2104, using the server IP address designated in the multimediabookmark. Thereafter, in step 4, the video server 2104 starts to streamthe video data down to the VMSC 2118. Subsequently, in step 5, the VMSC2118 passes the video data to the MSC 2124. Then, in step 6, the MSC2124 delivers the video data to the service requester, mobile device2128. Steps 4 though 6 are repeated until the mobile device 2128 issuesa termination request.

User History

The metadata associated with multimedia bookmark include positionalinformation and content information. The positional information can be atime code or byte offset to denote the marked time point of the videostream. The content information consists of textual information(features) and audio-visual information. There are two types of textualinformation depending upon its source: i) a bookmark user and ii) abookmark server. When a user makes a multimedia bookmark at the specificposition of the video stream (generally, multimedia file), i) a user caninput the text annotation and other metadata that the user would like toassociate with the bookmark, and/or ii) the multimedia bookmark system(server) delivers and associates the corresponding metadata with thebookmark. An example of metadata from the server includes the textualannotation describing the semantic information of the bookmarkedposition of the video stream.

The semantic annotation or description or indexing is often performed byhumans since it is usually difficult to automatically generate semanticmetadata by using the current state of the art video processingtechnologies. However, the problem is that the manual annotation processis time-consuming, and, further, different people, even the specialists,can differently describe the same video frames/segment.

The present invention discloses an approach to solve the above problemby making use of (bookmark) user's annotations. It enables videometadata to gradually be populated with information from users as timegoes by. That is, the textual metadata for each video frames/segment areimproved using a large number of users' textual annotations.

The idea behind the invention is as follows. When a user makes amultimedia bookmark at the specific position, the user is asked to enterthe textual annotation. If the user is willing to annotate for his/herown later use, the user will describe the bookmark using his/her ownwords. This textual annotation is delivered to the server. The servercollects and analyzes all the information from users for each videostream. Then, the analyzed metadata that basically represent the commonview/description among a large number of users are attached to thecorresponding position of the video stream.

For each video stream, there is a queue of size N, called “relevancequeue,” that keeps the textual annotation with the correspondingbookmarked position as shown in FIG. 54. Specifically, FIG. 54 shows arelevance queue 5402 having an enqueue 5404 and a dequeue 5406 with oneor more intermediate elements 5408.

The queue of FIG. 54 is initially empty. When a user makes a multimediabookmark at the specific position of the video stream (generallymultimedia file), a user inputs the text annotation that the user wouldlike to associate with the bookmark. The text annotation is delivered tothe server and is enqueued. For example, assume the first element of thequeue 5404 for the golf video stream V_(a) is “Tiger Woods;01:21:13:29.” A second user subsequently marks a new element at the01:21:17:00 in hours:minutes:seconds:frames of the golf video streamV_(a) (same video stream as before) and enters the keyword “Tee Shot.”Then, the first element is shifted to the second and the new input isentered into the relevance queue 5402 for the video stream V_(a) at theenqueue 5404. This queue operation continues indefinitely.

Periodically, the video indexing server 5410 regularly analyzes eachqueue. Suppose, for instance, that the video stream is segmented into afinite number of time intervals using the automatic shot boundarydetection method. The indexing server 5410 groups the elements insidethe queue by checking time codes so that the time codes for each groupare included by each time interval corresponding to each segment. Foreach group, the frequency of each keyword is computed and the highlyfrequent keywords are considered as new semantic text annotation for thecorresponding segment. In this way, the semantic textual metadata foreach segment can be generated by utilizing a large number of users.

Application of User History to Text Search Engine

When users make a bookmark for a specific URL like www.google.com, theycan add their own annotations. Thus, if the text engine maintains aqueue for each document/URL, it can collect a large number of users'annotations. Therefor, it can analyze the queue and find out the mostfrequent words that become new metadata for the document/URL.

In this way, the search engine would continuously have users update andenrich the text databases. This would help in the internationalizationof the process, as users who are not native speakers of the particularweb site content would annotate the contents in their own language andhelp their countrymen who conduct a search using their native tongue tofind the site.

Adaptive Refreshing

The present invention provides a methodology and implementation foradaptive refresh rewinding, as opposed to traditional rewinding, whichsimply performs a rewind from a particular position by a predeterminedlength. For simplicity, the exemplary embodiment described below willdemonstrate the present invention using video data. Three essentialparameters are identified to control the behavior of adaptive refreshrewinding: that is, how far to rewind, how to select which refreshframes in the rewind interval, and how to present the chosen refreshvideo frames on a display device.

Rewind Scope

The scope of rewinding implies how much to rewind a video back towardthe beginning. For example, it is reasonable to set 30 seconds beforethe saved termination position, or the last scene boundary positionviewed by the user. Depending on a user preference, the rewind scope maybe set to a particular value.

Frame Selection

Depending on the time a set of refresh frames is determined, theselection can be static or dynamic. A static selection allows therefresh frames to be predetermined at the time of DB population or atthe time of saving the termination position, while a dynamic selectiondetermines the refresh frames at the time of the user's request to playback the terminated video.

The candidate frames for user refresh can be selected in many differentways. For example, the frames can be picked out at random or at somefixed interval over the rewind interval. Alternatively, the frames atwhich a video scene change takes place can be selected.

Frame Presentation

Depending on the screen size of display devices, there might be twopresentation styles: slide show and storyboard. The slide show is goodfor devices with a small display screen while the storyboard may bepreferred with devices having a large display screen. In the slide showpresentation, the frames keep appearing sequentially on the displayscreen at regular time intervals. In the storyboard presentation, agroup of frames is simultaneously placed on the large display panel.

FIG. 55 illustrates an embodiment of the rewind aspect of the presentinvention. If during playback a video is paused, terminated or otherwiseinterrupted, the viewing user or the client system displaying the videopreferably sends a request to mark the video at the point ofinterruption to the server delivering the multimedia content to theclient device. As illustrated in FIG. 55, upon receipt of a request tomark, an instance between beginning 5504 and end 5518 of video ormultimedia content 5502 is preferably selected as the videos terminationor marked position 5514. Then, using marked position 5514 and metadataassociated with the video or multimedia content, the server randomlyselects a sequence of refresh frames 5506, 5508, 5510 and 5512 fromrewind interval 5516 for storage on a storage device. When the viewinguser or client later initiates playback of the interrupted video, theserver first delivers the sequence of refresh frames 5506, 5508, 5510and 5512 to the client. At the client system, refresh frames 5506, 5508,5510 and 5512 are preferably displayed either in a slide-show orstoryboard format before the video or multimedia content 5502 resumesplayback from termination or marked position 5514.

FIG. 56 illustrates an alternate embodiment of the rewind aspect of thepresent invention. In this embodiment, upon interruption of multimediacontent 5602, having a length from beginning 5604 to end 5608, such as avideo, a request to mark the current location of video is sent by theclient system to the network server. Having preferably run a scenechange detection algorithm over the video or multimedia content 5602 atthe time of database population, the network server has already retaineda list of scene change frames 5610, 5612, 5618, 5620, 5622, 5624, 5628and 5632. Using the list of scene change frames 5610, 5612, 5618, 5620,5622, 5624, 5628 and 5632 as well as the information associated withtermination or marked position 5630, the network server is able todetermine the sequence of refresh frames 5618, 5620, 5622, 5624 and 5628over the interval between viewing termination position 5630 andbeginning position 5614, or alternatively, the rewind internal 5616.Once playback of the video or multimedia content 5602 is restarted, thenetwork server preferably delivers to the client the sequence ofselected refresh frames 5618, 5620, 5622, 5624 and 5628. Refresh frames5618, 5620, 5622, 5624 and 5628 are then preferably displayed by theclient in a slide-show or storyboard manner before the video ormultimedia content 5602 continues from termination position 5630.

A third embodiment of the method of the present invention may also begleaned from FIG. 56. In this embodiment, a request to mark the currentlocation or termination position 5630 of the video is sent to thenetwork server by the client. When playback of the interrupted video ormultimedia content 5602 is later requested, the server preferablyexecutes a scene change detection algorithm on the rewind interval 5616,i.e., the segment of multimedia content 5602 between viewing beginningposition 5614 and termination position 5630. Upon completion of thescene detection algorithm, the network server sends the client systemthe resulting list of scene boundaries or scene change frames 5618,5620, 5622, 5624 and 5628, which will serve as refresh frames. Playbackof the video or multimedia content 5602 preferably begins uponcompletion of the client's display of refresh frames 5618, 5620, 5622and 5624.

Illustrated in FIG. 57 is a flow chart depicting a static method ofadaptive refresh rewinding implemented on a network server according toteachings of the present invention. Upon initiation at step 5702, method5700 preferably proceeds to step 5704, where the network server runs ascene detection algorithm on video or other multimedia content to obtaina list of scene boundaries in advance of video or other multimediacontent playback.

Upon completion of the scene detection algorithm at step 5704, method5700 preferably proceeds to step 5706, where a request received from aclient system by the network server is evaluated to determine its type.Specifically, step 5706 determines whether the request received by thenetwork server is a video or multimedia content bookmark or playbackrequest.

If the request is determined to be a playback request, the playbackrequest is preferably received by the network server at step 5708. Atstep 5710, the network server then preferably sends the client system apre-computed list of refresh frames and the previous terminationposition for the video or multimedia media content requested forplayback.

Alternatively, if the request is determined to be a video or multimediacontent bookmark request at step 5706, method 5700 preferably proceedsto step 5712. At step 5712, a multimedia bookmark, preferably usingtermination position information received from the client, may becreated and saved in persistent storage.

At step 5714, the rewind scope for the bookmark is preferably decided.As mentioned above, the rewind scope generally defines how much torewind the video or multimedia file back towards its beginning. Forexample, the rewind scope may be a fixed amount before the terminationposition or the last scene boundary prior to the termination position.User preferences may also be employed to determine the rewind scope.

Once the rewind scope has been decided at step 5714, method 5700preferably proceeds to step 5716 where the method of frame selection fordetermining the refresh scenes to be later displayed at the clientsystem is determined. As mentioned above, refresh frames can be selectedin many different ways. For example, refresh frames can be selectedrandomly, at some fixed-interval or at each scene change. Depending uponuser preference settings, or upon other settings, method 5700 mayproceed from step 5716 to step 5718 where refresh frames may be selectedrandomly over the rewind scope. Method 5700 may also proceed from step5716 to step 5720 where refresh frames may be selected at fixed orregular intervals. Alternatively, method 5700 may proceed from step 5716to step 5722 where refresh frames are selected based on scene changes.Upon completion of the selection of refresh frames at any of steps 5718,5720 or 5722, method 5700 preferably returns to step 5706 to await thenext request from a client.

Referring now to FIG. 58, a flow chart illustrating a method of adaptiverefresh rewinding implemented on a client system according to teachingsof the present invention is shown. Upon initiation at step 5802, method5800 preferably waits at step 5804 for a user request. Upon receipt of auser request, the request is evaluated to determine whether the requestis a video or multimedia content bookmark request or whether the requestis a video or multimedia content playback request.

If at step 5804, a video or multimedia content bookmark request isreceived, method 5800 preferably proceeds to step 5806. At step 5806, abookmark creation request is preferably sent to a network serverconfigured to use method 5700 of FIG. 57 or method 5900 of FIG. 59. Oncethe bookmark request has been sent, method 5800 preferably returns tostep 5804 where the next user request is awaited.

If at step 5804, a video or multimedia content playback request isreceived, method 5800 preferably proceeds to step 5808. At step 5808,the client system sends a playback request to the network serverproviding the video or multimedia content. After sending the playbackrequest to the network server, method 5800 preferably proceeds to step5810 where the client system waits to receive the refresh frames fromthe network server.

Upon receipt of the refresh frames at step 5810, method 5800 preferablyproceeds to step 5812 where a determination is made whether to displaythe refresh frames in a storyboard or a slide show manner. Method 5800preferably proceeds to step 5814 if a slide show presentation of therefresh frames is to be shown and to step 5816 if a storyboardpresentation of the refresh frames is to be shown. Once the refreshframes have been presented at either step 5814 or 5816, method 5800preferably proceeds to step 5820.

At step 5820, the client system begins playback of the interrupted videoor multimedia content from the previously terminated position (see FIGS.55 and 56). Once the video or multimedia content has completed playbackor is otherwise stopped, method 5800 preferably proceeds to step 5822where a determination is made whether or not to end the client'sconnection with the network server. The determination to be made at step5822 may be made from a user prompt, from user preferences, from serversettings or by other methods. If it is determined at step 5822 that theclient connection with the server is to end, method 5800 preferablysevers the connection and proceeds to step 5824 where method 5800 ends.Alternatively, if a determination is made at step 5822 that the clientconnection with the server is to be maintained, method 5800 preferablyproceeds to step 5804 to await a user request.

Referring now to FIG. 59, a flow chart illustrating a dynamic method ofadaptive refresh rewinding implemented on a network server according toteachings of the present invention is shown. Upon initiation at step5902, method 5900 preferably proceeds to step 5904 where a requestreceived from a client by the network server is evaluated to determineits type. Specifically, step 5904 determines whether the requestreceived by the network server is a video or multimedia content bookmarkor playback request.

If, at step 5904, the request is determined to be a video or multimediacontent bookmark request, method 5900 preferably proceeds to step 5906.At step 5906, a bookmark, preferably using termination positioninformation received from the client, may be created and saved inpersistent storage.

Alternatively, if at step 5904 the request is determined to be aplayback request, the playback request is preferably received by thenetwork server at step 5908. In addition, a decision regarding therewind scope of the playback request is made by the network server atstep 5908. Upon completing receipt of the playback request anddetermining the rewind scope, method 5900 preferably proceeds to step5910 where the type of refresh frame selection to be made is determined.

At step 5910, the network server determines whether refresh frameselection should be made based on randomly selected refresh frames fromthe rewind scope, refresh frames selected at fixed intervals throughoutthe rewind scope or scene boundaries during the rewind scope. If adetermination is made that the refresh frames should be selectedrandomly, method 5900 preferably proceeds to step 5912 where refreshframes are randomly selected from the rewind scope. If, at step 5910, adetermination is made that the refresh frames should be selected atfixed or regular intervals over the rewind scope, such selectionpreferably occurs at step 5914. Alternatively, if the scene boundariesshould be used as the refresh frames, method 5900 preferably proceeds tostep 5916. At step 5916, the network server preferably runs a scenedetection algorithm on the segment of video or multimedia contentbounded by the rewind scope to obtain a listing of scene boundaries.Upon completion of the selection of refresh frames at any of steps 5912,5914 or 5916, method 5900 preferably proceeds to step 5918.

At step 5918, the network server preferably sends the selected refreshframes to the client system. In addition, the network server alsopreferably sends the client system its previous termination position forthe video or multimedia content requested for playback. Once theselected refresh frames and the termination position have been sent tothe client system, method 5900 preferably returns to step 5904 whereanother client request may be awaited.

Storage of User Preferences

The multimedia bookmark of the present invention, in its simplest form,denotes a marked location in a video that consists of positionalinformation (URL, time code), content information (sampled audio,thumbnail image), and some metadata (title, type of content, actors). Ingeneral, multimedia bookmarks are created and stored when a user wantsto watch the same video again at a later time. Sometimes, however, themultimedia bookmarks may be received from friends via e-mail (asdescribed herein) and may be loaded into a receiving user's bookmarkfolder. If the bookmark so received does not attract the attention ofthe user, it may be deleted shortly thereafter. With the lapse of time,only the multimedia bookmarks intriguing the user will likely remain inthe user's bookmark folder, the remaining bookmarks thereby representingthe most valuable information about a user's viewing tastes.Accordingly, one aspect of the present invention provides a method andsystem embodied in a “recommendation engine” that uses multimediabookmarks as an input element for the prediction of a user's viewingpreferences.

FIG. 49, indicated generally at 4900, illustrates the elements of anembodiment of a multimedia bookmark of the present invention. Themultimedia bookmark 4902 contains positional information 4910 preferablyconsisting of a URL 4912 and a time code 4914. Content information 4920may also be stored in the multimedia bookmark 4902. Exemplary of thepresent invention, audio data 4922 and a thumbnail 4924 of the visualinformation are preferably stored in the content information 4920.Preferably included in metadata information 4930 of multimedia bookmark4902 are genre description 4932, the title 4934 of the associated videoand information regarding one or more actors 4936 featured in the video.Other types of information may also be stored in multimedia bookmark4902.

Indicated generally at 5000 in FIG. 50 is a block diagram depicting oneaspect of the method of the present invention. According to teachings ofthe present invention, a recommendation engine 5004 may be employed toevaluate a user's multimedia bookmark folder 5002 to determine orpredict a user's viewing preferences. Generally, recommendation engine5004 is preferably configured to read any positional, content and/ormetadata information contained in any of the multimedia bookmarks 5006,5008 and 5010 maintained in a user's multimedia bookmark folder 5002.

In one embodiment, the recommendation engine 5004 periodically visitsthe user's multimedia bookmark folder 5002 and performs a statisticalanalysis upon the multimedia bookmarks 5006, 5008 and 5010 maintainedtherein. For example, assume that a user has 10 multimedia bookmarks inhis multimedia bookmark folder. Further assume that five of thebookmarks are captured from sports programs, three are captured fromscience function programs, and two are captured from situation comedyprograms. As the recommendation engine 5004 examines the “genre”attribute contained in the metadata of each multimedia bookmark, itpreferably counts the number of specific keywords and infers that thisuser's most favorite genre is sports followed by science function andsituation comedy. Over time and as the user saves additional multimediabookmarks, the recommendation engine 5004 is better able to identify theuser's viewing preferences. As a result, whenever the user wishes toview a program, the recommendation engine can use its predictivecapabilities to serve as a guide to the user through a multitude ofprogram channels by automatically bringing together the user's preferredprograms. The recommendation engine 5004 may also be configured toperform similar analyses on such metadata information as the “actors,”“title,” etc.

Illustrated in FIG. 51, indicated generally at 5100, is a block diagramincorporating one or more EPG channel streams 5104 with teachings of thepresent invention. Upon receipt, by the multimedia bookmark process5106, of a user request for creation of a multimedia bookmark, thepreferred information to be associated with the multimedia bookmark,i.e., the positional, content and metadata information illustrated inFIG. 49, is preferably gathered. While aspects of the positionalinformation, i.e., desired URL and time code information, used in themultimedia bookmark as well as the content information, i.e., a desiredaudio segment and thumbnail image, may be gathered directly from thevideo's source, the metadata will likely have to be found elsewhere.Accordingly, in the embodiment illustrated in FIG. 51, the metadata(genre, title, actors) information sought by the multimedia bookmarkprocess 5106 may be obtained from the EPG channel 5102 via EPG channelstream 5104. This metadata is the source of information used by therecommendation engine of the present invention to examine the users'viewing preferences. After extracting the metadata from the EPG channelstream 5104, the multimedia bookmark process 5106 creates a newmultimedia bookmark and places the multimedia bookmark into the user'smultimedia bookmark folder on the user's storage device 5108.

Illustrated in FIG. 52 is a block diagram of a system incorporatingteachings of the present invention without an EPG channel. Upon receipt,by the multimedia bookmark process 5206, of a user request to create amultimedia bookmark, the preferred information to be associated with themultimedia bookmark, i.e., the positional, content and metadatainformation illustrated in FIG. 49, is preferably gathered. Again, thepositional and content information to be included in the multimediabookmark may be readily obtained from the video's source. However, toobtain the desired metadata, the multimedia bookmark process 5206preferably accesses network 5202 via two-way communication medium 5204to thereby establish a communication link with metadata server 5210.Preferably located on metadata server 5210 is such metadata as genre,title, actors, etc. Once a communication link is established betweenmultimedia bookmark process 5206 and metadata server 5210, themultimedia bookmark process 5206 may download or otherwise obtain themetadata information it prefers for inclusion in the multimediabookmark. After the desired metadata has been obtained by the multimediabookmark process 5206, the user's multimedia bookmark is preferablyplaced in the user's multimedia bookmark folder on the user's storagedevice 5208.

MetaSync First Embodiment

FIG. 68 shows the system to implement the present invention for a settop box (“STB”) with the personal video recorder (“PVR”) functionality.In this embodiment 6800 of the present invention, the metadata agent6806 receives metadata for the video content of interest from a remotemetadata server 6802 via the network 6804. For example, a user couldprovide the STB with a command to record a TV program beginning at 10:30PM and ending at 11:00 PM. The TV signal 6816 is received by the tuner6814 of the STB 6820. The incoming TV signal 6816 is processed by thetuner 6814 and then digitized by MPEG encoder 6812 for storage of thevideo stream in the storage device 6810. Metadata received by themetadata agent 6806 can be stored in a metadata database 6808, or in thesame data storage device 6810 that contains the video streams. The usercould also indicate a desire to interactively browse the recorded video.Assume further that due to emergency news or some technicaldifficulties, the broadcasting station sends the program out on the airfrom 10:45 PM to 11:15 PM.

In accordance with the user's directions, the PVR on the STB startsrecording the broadcast TV program at 10:30 sharp. In addition to therecording, since the user also wants to browse the video, the STB alsoneeds the metadata for browsing the program. An example of such metadatais shown in the Table 4. Unfortunately, it is not easy to automaticallygenerate the metadata on the STB if it has only limited processing (CPU)capability. Thus, the metadata agent 6806 requests from a remotemetadata server 6802 for the metadata needed for browsing the video thatwas specified by the user via the metadata agent 6806. Upon the request,the corresponding metadata is delivered to the STB 6820 transparently tothe user.

The delivered metadata might include a set of time codes/frame numberspointing to the segments of the video content of interest. Since thesetime codes are defined relative to the start of the video used togenerate the metadata, they are meaningful only when the start of therecorded video matches that of the video used for metadata. However, inthis scenario, there is a 15-minute time difference between the recordedcontent on the STB 6820 and the content on the metadata server 6802.Therefor, the received metadata cannot be directly applied to therecorded content without proper adjustments. The detailed procedure tosolve this mismatch will be described in the next section.

MetaSync Second Embodiment

FIG. 69 shows the system 6900 that implements the present invention whena STB 6930 with PVR is connected to the analog video cassette recorder(VCR) 6920. In this case, everything is the same as the previousembodiment, except for the source of the video stream. Specifically,metadata server 6902 interacts with the metadata agent 6906 via network6904. The metadata received by the metadata agent 6906 (and optionallyany instructions stored by the user) are stored in metadata database6908 or video stream storage device 6910. The analog VCR 6920 providesan analog video signal 6916 to the MPEG encoder 6912 of the STB 6930. Asbefore, the digitized video stream is stored by the MPEG encoder 6912 inthe video stream storage device 6910.

From the business point of view, this embodiment might be an excellentmodel to reuse the content stored in the conventional videotapes for theenhanced interactive video service. This model is beneficial to bothconsumers and content providers. Thus, unless consumers want very highquality video compared to VHS format, they can reuse their content whichthey already paid for whereas the content providers can charge consumersat the nominal cost for metadata download.

Video Synchronization with the Metadata Delivered

Forward Collation

Video synchronization is necessary when a TV program is broadcast behindschedule (noted above and illustrated in FIG. 70). Starting from thebeginning 7024 of one recorded video stream A′ (7020) of interest in theSTB, the forward collation is to match the reference frames/segment A1(7004) which is delivered from the server, against all the frames on theSTB and to find the most similar frames/segment A1′ (7024). As a resultof this matching, the temporal media offset value d (7010) isdetermined, which implies that each representative frame number (or timecode) that is received from the server for metadata services has to beadded by the offset d (7010). In this way, the downloaded metadata issynchronized with the video stream encoded in the STB. As illustrated inFIG. 70, the use of the offset 7010 enables correlation of frames A1(7004) to A1′ (7024), A2 (7006), and A3 (7008) to A3′ (7028).

For the synchronization, the server can send the STB characteristic dataother than image data that represents the reference frame or segment.The important thing to do is to send the STB a characteristic set ofdata that uniquely represents the content of reference frame or segmentfor the video under consideration. Such data can include audio data andimage data such as color histogram, texture and shape as well as thesampled pixels. This synchronization generally works for both analog anddigital broadcasting of programs since the content information isutilized.

In the case when the broadcast TV program to be recorded is in the formof digital video stream such as MPEG-2 and the downloaded metadata wasgenerated with reference to the same digital stream, the informationsuch as PTS (presentation time stamp) present in the packet header canbe utilized for synchronization. This information is needed especiallywhen the program is recorded from the middle of the program or when therecording of the program stops before the end of the program. Since boththe first and last PTSs are not available in the STB, it is difficult tocompute the media time code with respect to the start of the broadcastprogram unless such information is periodically broadcast with theprogram. In this case, if the first and the last PTSs of the digitalvideo stream are delivered to the STB with the metadata from the server,the STB can synchronize the time code of the recorded program withrespect to the time code used in the metadata by computing thedifference between the first and last PTS since the video stream of thebroadcast program is assumed to be identical to that used to generatethe metadata.

Backward Collation

A backward collation is needed when a TV program (7102) is broadcastahead of the schedule as illustrated in FIG. 71. Starting from the endof one recorded video stream A′ (7122) in the STB, the backwardcollation is to match the reference frame A1 (7104) from the metadataserver against all the frames on the STB and to find the most similarframe A1′ (7124) to the reference frame A1 (7104). As a result of thismatching, the offset value d (7110) is determined, which implies thateach representative frame number or time code that is received from theserver has to be subtracted by the offset d (7110) to obtain, forexample, the correlation between frames A2 (7106) with A2′ (7126) and A3(7108) with A3′ (7128) as illustrated in FIG. 71.

Detection of Commercial Clip

In this scenario, the user has set a flag instructing the STB to ignorecommercials that are embedded in the video stream. For this scenario,assume that the metadata server knows which advertisement clip isinserted in the regular TV program, but it does not know exactly thetemporal position of inserted clip. Assume further that the frame P(7212) is the first frame of the advertisement clip Sc (7230), the frameQ (7212) is the last frame of Sc (7230), the temporal length of the clipSc is dc (7236) and the total temporal length of the TV program (videostream A 7202) is dT (7204) as illustrated in FIG. 72.

i) Forward Detection of Advertisement Segment

Given the reference frame P (7212), examining the frames from thebeginning to the end of a recorded video stream A′ (7222), the mostsimilar frame P′ (7232) to the reference frame P (7212) is identified byusing an image matching technique and the temporal distance h1 (7224)between the start frame (7223) to the frame P′ (7232) is computed. Then,for each received representative frame whose frame number (or time code)is greater than h1 (7224), the value of d_(C) (7236) is added.

ii) Backward Detection of Advertisement Segment

Given the reference frame Q (7212), examining the frames from the end tothe head of a recorded video stream A′ (7222), the most similar frame Q′(7234) to the reference frame Q (7212) is found and the temporaldistance h2 (7226) between the end frame (7227) to the frame Q′ (7234)is computed. Then, for each received representative frame whose framenumber (or time code) is greater than d_(T)−(h2+d_(c)), it is adjustedby adding by d_(c) (7236).

Detection of Individual Program Segments from a Composite Video File

This case takes place when a user issues a request to record multipleprograms into a single video stream in a sequential order as shown inFIG. 73. For a given reference frame, this procedure computes the frame(or time code) offset from the first frame of the video stream up to theframe which is most similar to the reference frame. For example, assumethere are three reference start frames A1 (7304), B1 (7314), and C1(7324), and end frames 7306, 7316, and 7326, that are selected fromvideos A 7302, B 7312, and C 7322, respectively. For the reference frameA1 (7304), moving in the direction from the beginning to the end of thevideo stream 7303, the procedure matches the frame A1 (7304) against allthe frames on the stream 7303 and finds the most similar frame A1′(7344). The offset “offA” (7348) from the beginning 7305 to the locationof A1′ (7344) is now computed. This process is repeated in the samemanner for the other reference frames B1 (7314) and C1 (7324) for videostreams 7312 and 7322, respectively. That is, find the most similarframes B1′ (7354) and C1′ (7364) of the video streams 7352 and 7362,respectively and then compute the offset for the frame B1′ (7354), whichis “offB” (7358), followed by the offset for the frame C1′ (7364), whichis “offC” (7368) from the beginning 7305. This enables calculation ofthe end frames 7352 and 7366 of video streams 7352 and 7362,respectively. In this way, a user can access to the exact start and endpositions of each program. TABLE 4 An example of metadata for videobrowsing in XML Schema <?xml version=“1.0” encoding=“EUC-KR”?> <Mpeg7xmlns=http://www.mpeg7.org/2001/MPEG-7_Schema xmlns:xsi=“http://www.w3c.org/1999/XMLSchema-instance”  xml:lang=“en”type=“complete”>  <ContentDescription xsi:type=“SummaryDescriptionType”>  <Summarization>    <Summary xsi:type=“HierarchicalSummaryType”components=“keyVideoClips”     hierarchy=“independent”>     <SourceLocator>      <MediaUri>mms://www.server.com/news.asf</MediaUri>     </SourceLocator>      <HighlightSummary level=“0”duration=“00:01:35:04”>       <Name>Top Stories</Name>      <HighlightSegment>        <KeyVideoClip>          <MediaTime>          <MediaTimePoint>00:09:05:22</MediaTimePoint>          <MediaDuration>00:00:24:28</MediaDuration>         </MediaTime>        </KeyVideoClip>       <KeyFrame><MediaUri>16354.jpg</MediaUri></KeyFrame>      </HighlightSegment>       <HighlightChild level=“1”duration=“00:00:24:28”>        <Name>Wrestler Hogan</Name>       <HighlightSegment>         <KeyVideoClip>          <MediaTime>          <MediaTimePoint>00:09:05:22</MediaTimePoint>          <MediaDuration>00:00:24:28</MediaDuration>         </MediaTime>        </KeyVideoClip>        <KeyFrame><MediaUri>16354.jpg</MediaUri></KeyFrame>      </HighlightSegment>      </HighlightChild>      <HighlightChildlevel=“1” duration=“00:00:35:21”>       <Name>Gun Shoots inColorado</Name>       <HighlightSegment>        <KeyVideoClip>         <MediaTime>          <MediaTimePoint>00:09:30:20</MediaTimePoint>          <MediaDuration>00:00:35:21</MediaDuration>         </MediaTime>        </KeyVideoClip>        <KeyFrame><MediaUri>17096.jpg</MediaUri></KeyFrame>      </HighlightSegment>      </HighlightChild>      <HighlightChildlevel=“1” duration=“00:00:34:15”>       <Name>Women Wages</Name>      <HighlightSegment>        <KeyVideoClip>          <MediaTime>          <MediaTimePoint>00:10:06:11</MediaTimePoint>          <MediaDuration>00:00:34:15</MediaDuration>         </MediaTime>        </KeyVideoClip>        <KeyFrame><MediaUri>18171.jpg</MediaUri></KeyFrame>      </HighlightSegment>      </HighlightChild>     </HighlightSummary>   </Summary>   </Summarization>  </ContentDescription> </Mpeg7>Automatic Labeling of Captured Video With Text from EPG

Imagine that a show program from a cable TV is stored on a user's harddisk using PVR. Incidentally, if the user wants to browse the video, hewould need some metadata for it. One of the convenient ways to get themetadada about the show is to use the information from the EPG stream.Thus, if one could grab the EPG data, one could generate some level ofautomatic authoring and associate at least title, date, show time andother metadata with the video.

E-mail Attachments

Users often forget to attach documents when they send e-mail. A solutionto that problem would be to analyze the e-mail content and give amessage to the user asking if he or she indeed attached it. For example,if the user sets an option flag on his e-mail client software programthat is equipped with the present invention, a small program or othersoftware routine then analyzes the e-mail content in order to determineif there is the possibility or likelihood of an attachment beingreferenced by the user. If so, then a check is made to determine if thedraft e-mail message has an attachment. If there is no attachment, thena reminder message is issued to the user inquiring about the apparentneed for an attachment.

An example of the method of content analysis of the present inventionincludes:

-   -   1. Matching the words in the e-mail text by scanning the e-mail        contents for words like “enclose,” or “attach” or their        equivalent in other languages, preferably the language setting        designated by the user.    -   2. If one of the keywords is present, then determining if the        e-mail has at least one attachment.    -   3. If no attachment exists and a keyword was found, then issuing        a reminder message to the user regarding the need for an        attachment.

User Interface for Showing Relative Position

Reference is made to FIGS. 61 and 62 that illustrate portions of thehighlights of the Masters tournament of 1997. Specifically, in FIG. 61,is a browser window 6102 having a Web page 6104 and a remote control barbutton 6106 along the bottom of the window 6102. On the web page 6104are various hyperlinks and references made to portions of video, thethird round 6120, the fourth round 6122, Tiger Woods' biography 6124 andthe ending narration 6126. The remote control buttons have variousfunctionality, for example, there is a program list button 6108, abrowsing button 6110, a play button 6112, and a story board button 6116.In the center of the buttons is a multifunction button 6114 that can beenabled with various functionality for moving among various selectionswithin a web page. This is particularly useful if the page contains anumber of thumbnail images in a tabular format.

FIG. 62 contains a drill-down from one of the video links in FIG. 61.Specifically, in FIG. 62 there is the standard web browsing window 6202with the web page 6204 and the button control bar 6206. As with FIG. 61,the remote control button bar 6206 has identical functionality as theone described in FIG. 61. Similarly, the remote control buttons havevarious functionality, for example, there is a program list button 6208,a browsing button 6210, a play button 6212, and a story board button6216. As illustrated in FIG. 62, the selected image from FIG. 61, namely6120, appears in FIG. 62 again as element 6120. The corresponding videoportion of Tiger Woods' play on the ninth hole is element 6220, and theweb page illustrates several other video clips, namely the play to the18th hole 6232, and the interview with players 6234.

FIG. 60 illustrates a hierarchical navigation scheme of the presentinvention as it relates to FIGS. 61 and 62. This hierarchical tree isusually utilized as a semantic representation of video content.Specifically, there is the whole video 6002 that contains all the videosegments which compose a single hierarchical tree. Subsets of the videosegments were shown in video clip 6004, the third round 6020, the fourthround 6022, Tiger Woods' biography 6024, and the ending narration 6026that correspond to elements 6120, 6122, 6124 and 6126, respectively, ofFIG. 61. The lower three boxes of FIG. 60 correspond to the threechoices available, as illustrated in FIG. 62, namely, Tiger Woods' firstnine holes 6021, which corresponds to element 6220 of FIG. 62, as wellas Tiger Woods' second nine holes 6032, and the interview 6034, whichcorrespond to the remaining two elements illustrated in FIG. 62. Asshown in FIG. 60, the hierarchical navigation scheme allows a user toquickly drill down to the desired web page without having to wait forthe rendering of multiple interceding web pages. The hierarchical statusbar, using different colors, can be used to show the relative positionof the segment as currently selected by the user.

Referring back to FIG. 61, FIG. 61 further contains a status bar 6150that shows the relative position 6152 of the selected video segment6120, as illustrated in FIG. 61. Similarly, in FIG. 62, the status bar6250 illustrates the relative position of the video segment 6120 asportion 6252, and the sub-portion of the video segment 6120, i.e., 6254,that corresponds to Tiger Woods' play to the 18th hole 6232.

Optionally, the status bar 6150, 6250 can be mapped such that a user canclick on any portion of the mapped status bar to bring up web pagesshowing thumbnails of selectable video segments within the hierarchy,i.e., if the user had clicked on to a portion of the map correspondingto element 6254, the user would be given a web page containing startingthumbnail of Tiger Woods' play to the 18th hole, as well as Tiger Woods'play to the ninth hole, as well as the initial thumbnail for thehighlights of the Masters tournament, in essence, giving a quick map ofthe branch of the hierarchical tree from the position on which the userclicked on the map status bar.

Alternate Embodiments

Preferably, the video files are stored in each user's storage devices,such as a hard disk on a personal computer (PC) that are themselvesconnected to a P2P server so that those files can be downloaded to otherusers who are interested in watching them. In this case, if a user Amakes a multimedia bookmark on a video file stored in his/her localstorage and sends the multimedia bookmark via an e-mail to the user B,the user B cannot play the video starting from the position pointed toby the bookmark unless the user B downloads the entire video file fromuser A's storage device. Depending upon the size of the video file andthe bandwidth available, the full download could take a considerablelength of time. The present invention solves this problem by sending themultimedia bookmark as well as a part of the video as follows:

-   -   1) The user A sends the summary of the video generated manually,        or automatically by video analysis, or semiautomatically. The        summary could be a set of key frames representing the whole        video where one of the keyframes is the bookmarked frame that is        highlighted.    -   2) The user A then sends the short video clip file near the        bookmarked position. The video clip file can be generated by        editing the video file such as an MPEG-2, among others.        Thus, the user B can decide if he/she wants to download the        whole video after watching the part of the video containing the        bookmarked position. By use of the present invention, bandwidth        can be saved that would otherwise have been devoted to        downloading whole video files in which user B would not have        sufficient interest to justify the download.

Yet another embodiment of the present invention deals with the problemwith the broadcast video when the user cannot make the bookmark ofhis/her favorite segment when the segment disappears and thereafter anew scene appears at the same place in the video. One solution would beto use the time-shifting property of the digital personal video recorder(PVR). Thus, as long as a certain amount of video segment prior to thecurrent part of the video being played is always recorded by the PVR andstored in temporary (or permanent non-volatile) storage, the user alwayscan go back to his/her favorite position of the video.

Alternatively, suppose that the user A sends a bookmark to the user B asdescribed above. There still occurs a problem if the video is broadcastwithout video-on-demand functionality. In this case, when the smartset-top box (STB) of the user B receives a bookmark, the STB can checkthe electronic programming guide (EPG) and see if the same program willbe scheduled to be broadcast sometime in the future. If so, the STB canautomatically records the same program at the scheduled time and thenthe user B can play the bookmarked video.

2. Search

An embodiment of the present invention is based on the observation thatperceptually relevant images often do not share any apparent low-levelfeatures but still appear conceptually and contextually similar tohumans. For instance, photographs that show people in swimsuits may bedrastically inconsistent in terms of shape, color and texture butconceptually look alike to humans. In contrast to the methodologiesmentioned above, the present invention does not rely on the low-levelimage features, except in an initialization stage, but mostly on theperceptual links between images that are established by many human usersover time. While it is unfeasible to manually provide links between ahuge number of images at once, the present invention is based on thenotion that a large number of users over a considerable period of timecan build a network of meaningful image links. The method of the presentinvention is a scheme that accumulates information provided by humaninteraction in a simpler way than image feature-based relevance feedbackand utilizes the information for perceptually meaningful imageretrieval. It is independent of and complementary to the image searchmethods that use low-level features and therefor can be used inconjunction with them.

This embodiment of the method of the present invention is a set ofalgorithms and data structures for organizing and accumulating users'experience in order to build image links and to retrieve conceptuallyrelevant images. A small amount of extra data space, a queue of imagelinks, is needed for each query image in order to document the priorbrowsing and searching. Based on this queue of image links, a graph datastructure with image objects and image links is formed and theconstructed graph can be used to search and cluster perceptuallyrelevant images effectively. The next section describes the underlyingmathematical model for accumulating users' browsing and search based onimage links. The subsequent section presents the algorithm for theconstruction of perceptual relevance graph and searching.

Information Accumulation Using Image Links

Data Structure for Collecting Relevance Information

There are potentially many ways of accumulating information about users'prior feedback. The present invention utilizes the concept of collectingand propagating perceptual relevance information using simple datastructures and algorithms. The relevance information provided by userscan be based on image content, concept, or both. For storing an image'slinks to other images that some relevance is established to, each imagehas a queue of finite length as illustrated in FIG. 30. This is calledthe “relevance queue.” The relevance queue 3006 can be initially emptyor filled with links to computationally similar images (CSIs) determinedby low-level image feature descriptors such as color, shape and texturedescriptors that are commonly used in a conventional content-based imagesearch engine.

A perceptually relevant image (PRI) is determined by a user's selectionin a manner that is similar to that of general relevance feedbackschemes. When the image of interest is presented as a query and initialimage retrieval is performed, the user views the retrieved images andestablishes relevance by clicking perceptually related images aspositive examples. FIG. 30 illustrates the case of Image 5 3004 of theretrieved images 3002 being clicked and its link being enqueued 3010into the relevance queue Q_(n) 3006 of the query Image n 3008. Incontrast to previous relevance feedback schemes where the positiveexamples are used for adjusting low-level feature weights or distances,the method of the present invention inserts the link to the clickedimage, the PRI, into the query image's relevance queue by the normal“enqueue” operation 3010. The oldest image link is deleted from thequeue in a de-queue operation 3012. The list of PRIs for each imagequeue is updated dynamically whenever a link is made to the image by auser's relevance feedback, and thus, an initially small set of linkswill grow over time. The frequency at which a PRI appears in the queueis the frequency of the users' selection and can be taken as the degreeof relevance. This data structure that is comprised of image data andimage links will become the basic vertex and edge structures,respectively, in the relevance graph that is developed for imagesearching, and the frequency of the PRI will be used for determiningedge weights in the graph.

Conventional relevance feedback methods explicitly require users toselect positive or negative examples and may further require imposingweighting factors on selected images. In this embodiment of the presentinvention, users are not explicitly instructed to click similar images.Instead, the user simply browses and searches images motivated only bytheir interest. During the users' browsing and searching, it is expectedthat they are likely to click more often on relevant images thanirrelevant images so the relevance information is likewise accumulatedin the relevance queues.

Mathematical Model for Information Accumulation

It is conceivable to develop a sophisticated update scheme thatminimizes the variability of users' expertise, experience, goodwill andother psychological effects. In the present invention, however, only thebasic framework for PRI links without psychology-based user modeling ispresented. The assumption is that there are more users with goodintention than others, and in this case, it is shown in the experimentalstudies that the effect of sporadic false links to irrelevant images isminimized over time for the proposed scheme.

The structure of the image queue as defined above affords many differentinterpretations. The entire queue structure, one queue for each image inthe database, may be viewed upon as a state vector that gets updatedafter each user interaction, namely by the enqueue and dequeueoperations. If all images are labeled in the database by the image index1 through N, where N is the total number of images, the content of thequeue maybe represented by the queue matrix Q=[Q₁| . . . |Q_(N)] of sizeN_(Q)×N, where N_(Q) is the length of the image queue. The nth column ofthe queue matrix, Q_(n) contains the image indices as its elements andthey may be initialized according to some low-level image relevancecriteria.

When a user searches (queries) the database using the nth image, thesystem will return with a list of similar images on the display window.Suppose the user then clicks the image with index m. This would resultin updating the nth column Q_(n) of the queue matrix corresponding toenqueue and dequeue operations. This can simply be modeled by thefollowing update equation for the jth element of Q_(n):${Q_{n}(k)} = \left\{ \begin{matrix}{m,} & {k = 1} \\{{Q_{n}\left( {k - 1} \right)},} & {{k = 2},\ldots\quad,N_{Q}}\end{matrix} \right.$The queue matrix defined as such, immediately allows the followingdefinition of the state vector.

The state vector representing the image queue is defined by an N×Nmatrix S=[S₁| . . . |S_(N)] whose nth column S_(n) is an N×1 vectorwhich basically represents the image queue for the nth image in thedatabase. The jth element of S_(n) is defined to be:${{S_{n}(j)} = {\sum\limits_{{{All}\quad i\quad{such}\quad{that}\quad{Q_{n}{(i)}}} = j}^{\quad}{\alpha\left( {1 - \alpha} \right)}^{i - 1}}},$where 0<α<1. Note that if the weighting α(1−α)^(i−1) inside thesummation is 1, then S_(n) would simply be the histogram of imageindices of the nth image queue, Q_(n). Thus, S_(n) as defined above isbasically a weighted histogram of image indices of the nth image queueQ_(n). The weight α serves as the forgetting factor. Note that for aninfinite queue (N_(Q)=∞), S_(n) is a valid probability mass function asΣS_(n)(j)=1 and S_(n)(j)≧0. Even for a finite queue, for instance, withN_(Q)=256 and the forgetting factor α=0.1, the sum ΣS_(n)(j)≈1−2×10⁻¹².With the above relationship between the queue content and the statevector, evolution of the state at time p may be described by thefollowing update equation:S _(n) ^((p+1))=(1−α)S _(n) ^((p)) +αe _(EQ) ^((p))where e_(EQ) ^((p)) is a natural basis vector where all elements arezero except for one whose row index is identical to the index of theimage currently being enqueued. What one would like, of course, is forthis state vector to approach a state that makes sense for the currentdatabase content. Given a database of N images, assume that there existsa unique N×N image relevance matrix R=[R₁| . . . |R_(N)]. The matrix iscomposed of elements r_(mn), the relevance values, which in essence isthe probability of a viewer clicking the mth image while searching(querying) for images similar to the nth image. The actual values in therelevance matrix R will necessarily be different for differentindividuals. However, when all users are viewed upon as a collectivewhole, the assumption of the existence of a unique R becomes rathernatural. The state update equation, during steady-state operation, maybe expressed by the expectation operation:E[S _(n) ^((∞)) ]=E[e _(EQ) ^((∞)) ]=R _(n) =nth column of R

The above equality expresses precisely the desired result. That is, thestate vector (matrix) S converges to the image relevance matrix R,provided that an image relevance matrix exists. Although the discussionof the state vector is helpful in identifying the state to which itconverges, the actual construction and update (of the state vector) isnot necessary. As the image queue has all information that it needs tocompute the state vector (or the image relevance values), theimplementation requires only the image queue itself. The current statevector is computed as required. As such, it is during the imageretrieval process, when it needs to use the forgetting factor α toreturn images similar to the query image based on the current imagerelevance values.

Relevance Queue Initialization

The discussion in the previous subsection assumes steady state of therelevance queue. When a new image is inserted into a database, it doesnot have any links to PRIs and no images can be presented to a user toclick. The relevance queue is initialized with CSIs obtained with aconventional search engine in a manner that makes higher-ranked CSIshave higher relevance values. In the initialization stage, CSI links areput into the relevance queue evenly but higher-ranked CSI links morefrequently. An initialization method is illustrated for eight retrievedCSIs 3102 in the relevance queue 3106 in FIG. 31 where the image linknumbers denote the ranks of the retrieved CSIs. This technique ensuresthat higher-ranked CSIs will remain longer in the queue as users replaceCSIs with PRIs by relevance feedback.

Construction of Relevance Graph and Image Search

Construction of Relevance Graph

Graph is a natural model for representing syntactic and semanticrelationships among multimedia data objects. Weighted graphs are used bythe present invention to represent relevance relationships betweenimages in an image database. As shown in FIG. 47, the vertices 4706 ofthe graph 4702 represent the images and the edges 4708 are made by imagelinks in the image queue.

An edge between two image vertices P_(n) and P_(j) is established ifimage P_(j) is selected by users when P_(n) is used as a query image,and therefor image P_(j) appears for a certain number of times in theimage link queue of P_(n). The edge cost is determined by the frequencyof image P_(j) in the image link queue of P_(n) i.e., the degree ofrelevance established by users. Among many potential cost functions, thefollowing function is used:Cost(n,j)=Thr[1−S _(n)(j)],where the threshold function is defined as:${{Thr}\lbrack X\rbrack} = \left\{ \begin{matrix}{X,} & {X \leq {threshold}} \\{\infty,} & {otherwise}\end{matrix} \right.$The threshold function signifies the fact that P_(j) is related to P_(n)by a weighted edge only when P_(j) appears in the image link of P_(n)more than a certain number of times. If the frequency of P_(j) is verylow, P_(j) is not considered to be relevant to P_(n). Associative andtransitive relevance relationships are given as:(P _(n)→_(Cost(n,j)) P _(j))→_(Cost(j,k)) P _(k) =P _(n)→_(Cost(n,j))(P_(j)→_(Cost(j,k)) P _(k)),If P_(n)→_(Cost(n,j))P_(j) and P_(j)→_(Cost(j,k))P_(k), thenP_(n)→_(Cost(n,k))P_(k),where P_(n)→_(Cost(n,j))P_(j) denotes the relevance relationship fromP_(n) to P_(j) with Cost(n,j), and Cost (n,k)=Cost(n,j)+Cost(j,k).

It would require many user studies using various sets of images todetermine which of the symmetric and asymmetric relevance relationshipsis more effective. A relevance relationship can possibly be asymmetricwhile a relevance graph is generally a directed graph. However, in thepresent invention, assume a symmetric relationship simply because itpropagates image links more in a graph for a given number of usertrials. The symmetry of relevance is represented by the symmetric costfunction:Cost(n,j)=Cost(j,n)=Min[Cost(n,j),Cost(j,n)],and the commutative relevance relationship:P _(n)→_(Cost(n,j)) P _(j) =P _(j)→_(Cost(j,n)) P _(n).The symmetry of relevance relationship results in undirected graphs asshown in FIG. 47. Specifically, FIG. 47 illustrates an undirected graph8102 for a set of eight images and its adjacency matrix 4704,respectively.

Image Search

The present invention employs a relevance graph structure that relatesPRIs in a way that facilitates graph-based image search and clustering.Once the image relevance is represented by a graph, one can use numerouswell-established generic graph algorithms for image search. When a queryimage is given and it is a vertex in a relevance graph, it is possibleto find the most relevant images by searching the graph for thelowest-cost image vertices from the source query vertex. A shortest-pathalgorithm such as Dijkstra's will assign lowest costs to each vertexfrom the source and the vertices can be sorted by their costs from thequery vertex. See, Mark A. Weiss, “Algorithms, Data Structures, andProblem Solving with C++,” Addison-Wesley, Mass., 1995.

Hypershell Search

Generally, the first step of most image/video search algorithms is toextract a K-dimensional feature vector for each image/frame representingthe salient characteristics to be matched. The search problem is thentranslated as the minimization of a distance function d(o_(i),q) withrespect to i, where q is the feature vector for the query image ando_(i) is the feature vector for the i-th image/frame in the database.Further, it has been known that search time can be reduced when thedistance function d(.,.) has metric properties: 1) d(x,y)≧0; 2) d(x,y)=d(y, x); 3) d(x, y)≦d(x, z)+d(z, y) (a triangular inequality). Usingthe metric properties, particularly triangular inequality property, thehypershell search disclosed in the present invention also reduces thenumber of distance evaluations at query time, thus resulting in the fastretrieval. Specifically, the hypershell algorithm uses the distances toa group of predefined distinguished points (hereafter called referencepoints) in a feature space to speed up the search.

To be more specific, the hypershell algorithm computes and stores inadvance the distances to k reference points (d(o,p₁), . . . ,d(o,p_(k))) for each feature vector o in the database of images/frames.Given the query image/frame q, its distances to the k reference points(d(q,p₁), . . . , d(q,p_(k))) are first computed. If, for some referencepoint p_(i), |d(q,p_(i))−d(o,p_(i))|>ε, then d(o,q)>ε holds bytriangular inequality, which means that the feature vector o is notclose enough to the query q that there is no need to explicitly evaluated(o, q). This is one of the underlying ideas of the hypershell searchalgorithm.

Indexing (or Preprocessing)

To make videos searchable, the videos should be indexed. In other words,prior to searching the videos, a special data structure for the videosshould be built in order to minimize the search cost at query time. Theindexing process of the hypershell algorithm consists of a couple ofsteps.

First, the indexer simply takes a video as an input and sequentiallyscans the video frames to see if they can be representative frames (orkey frames), subject to some predefined distortion measure. For eachrepresentative frame, the indexer extracts a low-level feature vectorsuch as color correlogram, color histogram, or color coherent vector.The feature vector should be selected to well represent the significantcharacteristics of the representative frame. The current exemplaryembodiment of the indexer uses color correlogram that has information onspatial correlation of colors as well as color distribution. See, J.Huang, S. K. Kumar, M. Mitra, W. Zhu and R. Zabih, “Image indexing usingcolor correlogram,” in Proc. IEEE on Computer Vision and PatternRecognition, 1997.

Second, the indexer performs PCA (Principal Component Analysis) on thewhole set of the feature vectors extracted in the previous step. The PCAmethod reduces the dimensions of the feature vectors, therebyrepresenting the video more compactly and revealing the relationshipbetween feature vectors to facilitate the search.

Third, given the metric distance such as L₂ norm, the LBG(Linde-Buzo-Gray) clustering is performed on the entire population ofthe dimension-reduced feature vectors. See, Y. Linde, A. Buzo and R.Gray, “An algorithm for vector quantization design,” in IEEE Trans. onCommunications, 28 (1), pp. 84-95, January, 1980. The clustering startswith a codebook of a single codevector (or cluster centroid) that is theaverage of the entire feature vectors. The code vector is split into twoand the algorithm is run with these two codevectors. The two resultingcodevectors are split again into four and the same process is repeateduntil the desired number of codevectors is obtained. These clustercentroids are used as the reference points for the hyperhsell searchmethod.

Finally, the indexer computes distance graphs for each reference pointand each cluster. For a reference point p_(i) and a cluster C_(j), thedistance graph G_(i,j)={(α,n)} is a data structure to store a sequenceof value pairs (α,n), where α is the distance from the reference pointp_(i) to the feature vectors in the cluster C_(j) and n is the number offeature vectors at the distance a from p_(i). Therefor, if the number ofreference points is k and the number of cluster m, then mk distancegraphs are computed and stored into a database.

The indexing data such as dimension-reduced feature vectors, clusterinformation, and distance graphs produced at the above steps are fullyexploited by the hypershell search algorithm to find the best matches tothe query image from the database. FIG. 48 illustrates this indexingprocess.

FIG. 48 illustrates the system 4800 of the present invention forimplementing the hypershell search. The system 4800 is composedgenerally of an indexing module 4802 and a query module 4804. Theindexing module contains storage devices in a storage module 4806 forstoring frame and vector data. Specifically, storage space is allocatedfor key frames 4808, dimension-reduced feature vectors 4810, clustersand related centroids 4812, and distance graphs 4816. The storageelements mentioned above can be combined onto a single storage device,or dispersed over multiple storage devices such as a RAID array, storagearea network, or multiple servers (not shown). In operation the digitalvideo 4836 is sent to a key frame module 4818 which extracts featurevector information from selected frames. The key frames and associatedfeature vectors are then forwarded to the PCA module 4820 which bothstores the feature vector information into storage module 4810, as wellas forwards the dimension-reduced feature vectors 4840 to the LGBclustering module 4822. The LGB clustering module 4822 stores theclusters and their associated centroids into the cluster storage module4812 and forwards the clusters and their centroids to the compute module4824. The compute module 4824 computes the distance graphs and storesthem into the distance graph storage module 4816. The indexing module4802 is typically a combination of hardware and software, although theindexing module is capable of being implemented solely in hardware orsolely in software.

The information stored in the indexing module is available to the querymodule 4802 (i.e., the query module 4804 is operably connected to theindexing module 4802 through a data bus, network, or othercommunications mechanism). The query module 4802 is typicallyimplemented in software, although it can be implemented in hardware or acombination of hardware and software. The query module 4804 receives aquery 4834 (typically in the form of an address or vector) for image orfor frame information. The query is received by the find module 4826which finds the nearest one or more clusters nearest to the queryvector. Next, in module 4828, the hypershell intersection (either basic,partitions, and/or partitions-dynamic) is performed. Next, in module4830, all of the feature vectors that are within the intersected regions(found by module 4828) are ranked. Thereafter, the ranked results aredisplayed to the user via display module 4832.

Search Algorithm

The problem of proximity search is to find all the feature points whosedistance from a query point q is less than distance ε where distance εis a real number indicating the fidelity of the search results. See, E.Chavez, J. Marroquin and G. Navarro, “Fixed queries array: a fast andeconomical data structure for proximity searching,” in Multimedia Toolsand Applications, pp. 113-135, 2001. The present invention called thehypershell search algorithm provides one of the efficient solutions forthe proximity search.

A two-dimensional feature vector space is assumed in FIG. 63 forsimplicity. Assume further that there are two reference points p₁ andp₂, respectively, in the 2D feature space. Given a query point q, thehypershell search first computes all of the distances D_(i) (i=1,2)between the query point q and the reference points p_(i) (i=1,2) andthen generates one hypershell for each reference point. Each hypershelldenoted by 6302 and 6304 is preferably 2ε in thickness and lies D_(i)(i=1,2) away from its center located at its corresponding referencepoint p_(i). The intersection of the two hypershells 6302 and 6304 leadsto the two regions I₁ and I₂ indicated in bold lines in FIG. 63. Asillustrated, the intersection region I₁ includes a circle S of radius εcentered at query point q.

The feature points inside the circle S of FIG. 63 are those featurepoints similar to the query point q, up to the degree of ε, and thus arethe desired results of a proximity search. The value of ε may bepredetermined at the time of database buildup or determined dynamicallyby a user at the time of query. Since all the points in the circle arecontained in the intersections I₁ and I₂, it is desirable to search onlythe intersections instead of the whole feature space, thus dramaticallyreducing the search space.

As illustrated in FIG. 63, there may be more than one intersectionresulting from hypershell intersection in a multidimensional featurespace. For example, the two intersected regions I₁ and I₂, of the 2-Dfeature space are illustrated in FIG. 63. In such case, however, it ispossible that one or more of intersected regions may be irrelevant tothe search. For example, in FIG. 63, the region I₁ is highly pertinentto the query point q while the region I₂ is not. Thus, to improve searchperformance, the least relevant regions, such as I₂, should beeliminated. One way to achieve such elimination is to partition theoriginal feature space into a certain number of smaller spaces (alsocalled clusters) and to apply the hypershell intersection to theclusters or segmented feature spaces. FIG. 64 illustrates clusters 6402,6404, 6406, 6408, 6410, 6412, 6414 and 6416 whose boundaries are denotedby dotted lines. Collectively, the dotted lines may be referred to as aVoronoi diagram of cluster centroids. Referring to FIGS. 63 and 64,among the intersection I₁ and I₂, only the region I₁ would be considereda relevant region because it resides inside the same cluster to whichthe query point Q belongs.

Three Preferred Embodiments

In searching for information according to the present invention, one ormore of three preferred methods may be employed. In one embodiment ofthe present invention where clusters are not employed, a basichypershell search algorithm may be used. In another embodiment of thepresent invention where clusters obtained by using the LBG algorithmdescribed above are employed to improve search times, a partitionedhypershell search algorithm or a partitioned-dynamic hypershell searchalgorithm may be used. The basic hypershell search algorithm isdiscussed below with reference to FIG. 65. The partitioned hypershellsearch algorithm and the partitioned-dynamic hypershell search algorithmare also discussed below with reference to FIGS. 66 and 67,respectively. Regardless of the search algorithm employed, however, fora given query image/frame q and distortion ε, a set of theimages/frames, O, satisfying,O={o _(k) |d(o _(k) ,q)≦ε,o _(k) εR}are searched, where R is an image/video database and d(.,.) is a metricdistance.

Basic Hypershell Search Algorithm

In the first preferred embodiment of the basic hypershell searchalgorithm, O = {o_(k)|d(o_(k), q) ≤ ɛ, o_(k) ∈ I}$I = {\overset{J}{\bigcap\limits_{j = 1}}I_{j}}$I_(j) = {i_(k)|d(i_(k), p_(j)) − d(q, p_(j)) ≤ ɛ},where p_(j)'s are the predetermined reference points and J is the numberof reference points. And, I_(j) denotes the hypershell that is 2ε wideand centered at the reference point p_(j), and I denotes the set ofintersections obtained by intersecting all the hypershells I_(j). Asillustrated in FIG. 65, three hypershells 6502, 6504, and 6506 aregenerated by the basic hypershell search algorithm upon running animage/frame query with a distortion ε. Further, the use of thehypershells 6502, 6504 and 6506 produces the intersection 6508, boundedby bold lines. As mentioned above, the feature vector points within theintersection 6508 include those points that would be retrieved in aproximity search. It is worth noting that compared with the other twoembodiments described afterward, the basic shell search algorithm tendsto cause a considerable search cost, namely time to intersecthypershells, because the number of data (image/frame) points containedin the intersection are usually relatively larger than the other twomethods.

Partitioned Hypershell Search Algorithm

In the second preferred embodiment of the partitioned hypershell searchalgorithm, O = {o_(k)|d(o_(k), q) ≤ ɛ, o_(k) ∈ I}${I = {{\overset{J}{\bigcap\limits_{j = 1}}{I_{j}I_{j}}} = \left\{ {\left. i_{k} \middle| {{{{d\left( {i_{k},p_{j}} \right)} - {d\left( {q,p_{j}} \right)}}} \leq ɛ} \right.,{i_{k} \in C_{n}}} \right\}}},$where C_(n) represents the closest cluster from query image/frame q.Similarly to the first embodiment, I_(j) denotes the hypershell that is2ε wide and centered at the reference point p_(j) and I denotes the setof intersections obtained by intersecting all the hypershells. In thiscase, however, only the portion of hypershells surrounded by theexpanded boundary by ε of cluster C_(n) as shown in FIG. 66 is searched.Without the boundary expansion, a feature point o that is close enoughto the query image q (i.e., d(o,q)≦ε) but resides in the neighboringcluster would not be included in the outcome of the proximity search. Itis often the case that many other cluster-based search algorithms do notguarantee the search results with a given fidelity. The lines 6602,6604, 6606 and 6608 indicate the original cluster boundaries, the dottedlines 6610 and 6612 indicate the original cluster boundaries expanded bya distortion ε, and the darkened region 6614 denotes the expandedcluster C_(n) that includes the expansion region 6616 over which thesearch is performed.

Similar to FIG. 65, FIG. 66 illustrates three hypershells 6618, 6620 and6622 that were created upon running an image/frame query q given adistortion ε. After partitioning the region of hypershells 6618, 6620and 6622, as indicated by cluster boundaries 6602, 6604, 6606 and 6608,the region 6614 can be selected as the most pertinent region for furtherconsideration. For the region 6614, the intersecting region 6624 isidentified and actually searched.

Partitioned-Dynamic Hypershell Search Algorithm

While the partitioned hypershell search algorithm is the fastest ofthree algorithms, it also has a larger memory requirement than itsalternatives. The extra storage is needed due to boundary expansion. Forinstance, a feature (image/frame) point near a cluster boundary, i.e.,boundary lines 6702, 6704, 6706 and 6708 of FIG. 67, often turns out tobe an element contained in the multiple clusters. Therefor, as analternative, the partitioned-dynamic hypershell search algorithm is alight version of partitioned hypershell search algorithm with lessmemory requirement, but approximately same search time as thepartitioned hypershell search algorithm.O = {o_(k)|d(o_(k), q) ≤ ɛ, o_(k) ∈ I}$I = {{{\overset{J}{\bigcap\limits_{j = 1}}I_{j}}I_{j}} = \left\{ {\left. i_{k} \middle| {{{{d\left( {i_{k},p_{j}} \right)} - {d\left( {q,p_{j}} \right)}}} \leq ɛ} \right.,{i_{k} \in C}} \right\}}$C = ⋃C_(k) : d(C_(k), q) ≤ r + ɛ$r = {\min\limits_{k}{d\left( {C_{k},q} \right)}}$where d(C_(k),q) is the distance between a center of cluster and afeature point. The I_(j) denotes the hypershell that is 2ε wide andcentered at the reference point p_(j), and I denotes the set ofintersections obtained by intersecting all the hypershells. The r is theshortest of all the distances between the query point and the clustercentroids. The C is the set of clusters whose centroids are within thedistance r+ε from the query point.

Fast Codebook Search

Given an input vector Q, a codebook search problem is defined to selecta particular code vector X_(i) in a codebook C such that∥Q−X _(i) ∥<∥Q−X _(j)∥ for j=1,2, . . . ,N,j≠iwhere N denotes the size of codebook C. The present invention of thefast codebook search is used to find the closest cluster for thehypershell search described previously.

Multi-Resolution Structure Based on Haar Transform

Let H(•) stand for the Haar transform. Suppose further that a vectorX=(x₁,x₂, . . . ,x_(k))εR^(k), and its transformed one, X^(h)=H(X)=(x₁^(h),x₂ ^(h), . . . ,x_(k) ^(h)), where k is the power of 2, forexample, 2^(m). Then, a Haar-transform based multi-resolution structurefor vector X is defined to be a sequence of vectors {X^(h,0),X^(h,1), .. . ,X^(h,n), . . . , X^(h,m)}, where X^(h,n) is an n-th level vector ofsize 2^(n) and X^(h,m)=X^(h). The multi-resolution structure is built inbottom-up direction, taking the vector X^(h)=X^(h,m) as an initial inputand successively producing the (m−1), (m−2), . . . , n, . . . , 2, 1,0-th level vectors in this order. Specifically, n-th level vector isobtained from (n+1)-th level vector by simple substitution:X ^(h,n) [p]=X ^(h,(n+1)) [p] for p=1,2, . . . ,2^(n)where X^(h,n)[p] denotes p-th coordinate of vector X^(h,n).

FIG. 29 illustrates the use of the Haar transform in the presentinvention. Specifically, the original feature space 2902 containsvarious elements X⁰ 2904, X¹ 2906, X² 2908, and X³ 2910 as illustratedin FIG. 29. Upon the transformation 2930, there appear the correspondingtransform elements X^(h,0) 2914, X^(h,1) 2916, X^(h,2) 2918, and X^(h,3)2920 in the Haar transform space 2912 corresponding to elements X⁰ 2904,X¹ 2906, X² 2908, and X³ 2910, respectively.

Properties

Property 1:

Suppose Q=(q₁,q₂, . . . ,q_(k)) X=(x₁, x₂, . . . ,x_(k)), Q^(h)=H(Q)=(q₁^(h), q₂ ^(h), . . . , q_(k) ^(h)), and X^(h)=H(X)=(x₁ ^(h), x₂ ^(h), .. . , x_(k) ^(h)). Then, the L₂ distance between Q and X is equal to theL₂ distance of between Q^(h) and X^(h):$\sqrt{\sum\limits_{i = 1}^{k}\left( {q_{i} - x_{i}} \right)^{2}} = \sqrt{\sum\limits_{i = 1}^{k}\left( {q_{i}^{h} - x_{i}^{h}} \right)^{2}}$

Property 2:

Assume that D^(n)(Q^(h), X^(h)) symbolizes the L₂ distance between twon-th level vectors Q^(h,n) and X^(h,n) in Haar transform space. Then thefollowing inequality holds true:D ^(m)(Q ^(h) ,X ^(h))≧D ^(m−1)(Q ^(h) ,X ^(h))≧ . . . ≧D ¹(Q ^(h) , X^(h))≧D ⁰(Q ^(h) ,X ^(h))

The following pseudo code provides a workable method for the use of thecookbook search: Input: Q // query vector  HaarCodeBk // codebook), CbSize // size of codebook)  VecSize // Dimension of codevector)Output: NN // index of the codevector nearest to Q)

Algorithm: min_dist = ∞; Q_haar = HaarTrans(Q); // Compute Haartransform of Q for(i = 0; i < CbSize; i++) {  for(length = 1; length <=VecSize; length = length * 2)  {   dist = LevelwiseL2Dist (Q_haar,HaarCodeBk[i], length);   if(dist >= min_dist)   {    break;    // Go tothe outer loop to try another codevector   }   if(length == VecSize)   {   min_dist = dist;    NN = i;   }  } } return NN; Peer to PeerSearching

To the best of the present inventors' knowledge, most of current P2Psystems perform searches only using a string of keywords. However, it iswell-known that if the search for multimedia content is made with visualfeatures as well as the textual keywords, it could yield the enhancedresults. Furthermore, if the search engine is enforced by advantages ofP2P computing, the scope of the results can be expanded to include aplurality of diverse resources on peer's local storage as well as Webpages. Additionally, the time dedicated to the search will be remarkablyreduced due to the distributed and concurrent computing. Taking the bestparts from the visual search engine and the P2P computing architecture,the present invention offers a seamless, optimized integration of bothtechnologies.

Basic assumptions underlying the implementation of this method of thepresent invention: (Gnutella model: server-less model or purepeer-to-peer model)

-   -   1. The network consists of nodes (i.e., peers) and connections        between them.    -   2. The nodes have same capability and responsibility. There is        no central server node. Each node functions as both a client and        a server.    -   3. A node knows only its own neighbors.

The following is a scenario to find image files according to anembodiment of the present invention:

-   -   1. A new user (denoted as NU) enters the P2P network.    -   2. NU broadcasts or multicasts a message called ping to announce        its presence.    -   3. Nodes that receive the ping send a pong back to NU to        acknowledge that they have received the ping message.    -   4. NU keeps track of nodes that sent those pong messages so that        it retains a list of active nodes to which NU is able to        connect.    -   5. When NU initiates a search request, it broadcasts or        multicasts to the network the query message that contains visual        features as well as a string of keywords.    -   6. A node (denoted as SN) that receives the query message runs        image search engine upon the image database on the node's local        storage. If SN finds images to satisfy the search criteria, it        responds to NU with the search result message that may contain        the SN's IP address and a list of found file sizes and names.    -   7. NU attempts to make a connection to the node SN using SN's IP        address and download image files.    -   8. If NU triggers another search request, go to step 5.        Otherwise, it terminates the connection and leaves the P2P        network.

FIG. 25 is a flowchart illustrating the method 2500 of the presentinvention. The method begins generally at step 2502. Thereafter, a newuser (NU) enters the peer-to-peer (P2P) network in step 2504. The newuser multicasts a “ping” (service request) signal to announce itspresence in step 2506. The new user then waits to receive one or more“pong” (acknowledgement) signals from other users on the network, step2508. The new user keeps track of the nodes that sent “pong” messages inorder to retain a list of active nodes for subsequent connections, step2510. The new user then initiates a search request by multicasting aquery message to the network in step 2512. The source node (SN) 2524receives the new user's search request and executes a “visual” searchusing the query parameters in the new user's query message, step 2526.The source node then routes the search results to the new user in step2528. The new user receives the search result message that contains thesource node's IP address as well as a list of names and sizes of foundfiles, step 2514. Thereafter, the new user makes a connection to thesource node using the source node's IP address, and downloads multimediafiles, in step 2516. A check is made to determine if the new user wantsanother search request in step 2518. If so, the execution loops back tothe step 2512. Otherwise, the user leaves the P2P network in step 2520and terminates the program in step 2522.

3. Editing

[DS_(—)3_Editing.doc]

The present invention includes a method and system of editing videomaterials in which it only edits the metadata of input videos to createa new video, instead of actually editing videos stored as computerfiles. The present invention can be applied not only to videos stored onCD-ROM, DVD, and hard disk, but also to streaming videos on a local areanetwork (LAN) and wide area networks (WAN) such as the Internet. Thepresent invention further includes a method of automatically generatingan edited metadata using the metadata of input videos. The presentinvention can be used on a variety of systems related to video editing,browsing, and searching. This aspect of the present invention can alsobe used on stand-alone computers as well those connected to a LAN or WANsuch as the Internet.

In order for the present invention to achieve such goals, metadata of aninput video file to be edited contain a URL of the video file andsegment identifiers which enables one to uniquely identify metadata of asegment such as time information, title, keywords, annotations, and keyframes of the segment. A virtually edited metafile contains metadatacopied from some specific segments of several input metafiles, orcontains only the URIs (Uniform Resource Identifier) of these segments.In the latter, each URI consists of both a URL of the input metafile andan identifier of the segment within the metafile.

The significance and the practical application of the present inventionare described in detail by referencing the illustrated figures.

FIG. 32 compares the former video editing concept 3200 with the conceptof virtual editing in the present invention 3200′. In FIG. 32, it isassumed that the metadata used during the virtual editing, is stored ona separate metafile. Referring to FIG. 32, the prior art method (FIG.32(a)) merely sends the various video files 3202 to the video editor3206 where a user edits the videos to produce an edited video 3208. Incontrast, the method of the present invention, as illustrated in FIG.32(b), utilizes metafiles 3204 of the videos 3202 and edits themetafiles 3204 in the virtual video editor 3206′ to produce a metafile3210 of a virtually edited video.

FIG. 33 is an example of the creation of a new video using the virtualediting of the present invention with the metafile of the three videos.Video 3340 consists of four segments 3342, 3344, 3346, 3348 thatcorrespond to elements 1, 2, 3, and 4, respectively, in the metafile3302 of video 3340. Segments 1 and 2 of metafile 3302 are grouped tosegment 5; segments 3 and 4 are grouped to segment 6, and segments 5 and6 themselves are grouped into segment 7 of metafile 3302. Similarly,video 2 (3350) has three segments 3352, 3354, and 3356 which correspondto elements a, b, and c, respectively, of metafile 3304. As withmetafile 3302, metafile 3304 groups the elements in a hierarchicalstructure (a and b into d, and c and d into e). Video 3 (3360),meanwhile, has five elements 3362, 3364, 3366, 3368, and 3370 thatcorrespond to elements A, B, C, D, and E, respectively, of metafile 3306as illustrated in FIG. 33. As with the other two metafiles, metafile3306 has its elements grouped in a hierarchical structure, namely, A, B,and C into F; and D and E into G from which F and G are grouped into Has illustrated in FIG. 33.

The virtually edited metadata 3308 is composed of segments 3310, 3316,3322, and 3328 each of which has an segment identifiers 3312, 3318,3324, and 3330, respectively, indicating that, for example, segment 3310is from segment 5 (3314) of metadata 3302, segment 3316 is from segmentc (3320) of metadata 3304, and segments 3322 and 3328 are from segment A(3326) and C (3332) of metadata 3306 as shown in FIG. 33. In order toform a hierarchical structure with the above segments, two segments 3380and 3382 are defined in metafile 3308 as shown in FIG. 33.

There are two kinds of segments within the metafile of the virtuallyedited video: a component segment of which the metadata has already beendefined in the input video metafile, such as segments 3310, 3316, 3322,and 3328, and a composing segment of which the metadata is newly definedin the metafile of the edited video such as segments 3380 and 3382. Acomposing segment can have other composing segments and/or componentsegments as its child node, while the component segment cannot have anychild nodes. Virtual video editing is, essentially, the process ofselecting and rearranging segments from the several input videometafiles, hence the composing segments are defined in such a way as toform a desired hierarchical tree structure with the component segmentschosen from the input metafiles.

FIG. 33 describes the process of generating the virtually editedmetadata. Segment 5 (3314) of metafile 3302, the segment to be edited,is selected by browsing through metafile 3302. Composing segment 3382 isnewly generated, and it has the selected segment 5 (3314) as its childnode by generating a new segment 3310 and saving an identifier of thesegment 5 (3314) into the new segment. Therefor, the new segment 3310becomes a component segment within the hierarchical structure beingedited. Segment c (3320), another segment to be edited, is selected bybrowsing through metafile 3304. In order to make the selected segment c(3320) be a child of the segment 3382, a new segment 3316 is generatedand an identifier of the segment c (3320) is saved into the new segment.One can browse through metafile 3306, and want to make twonon-consecutive segments A (3326) and C (3332) be a consecutive segmentand give some title to the new segment. The composing segment 3382 hasthen another newly created composing segment 3380 as its child node,write the title into metadata of the segment 3380. The segment 3380 hasthe selected segments A (3326) and C (3332) as its children bygenerating two new segment 3322 and 3328, and saving identifiers of thesegment A (3326) and C (3332) into the new segments, respectively. Thenew segments 3322 and 3328 thereby become component segments within thehierarchical structure being edited.

Eventually, the edited metadata of FIG. 33 must be transformed intovideo that is useful to the user. FIG. 34 illustrates the virtuallyedited metadata 3408 and its corresponding restructured video 3440.Specifically, segment 5 (3414) presents video segments 3442 and 3444.Similarly, segment c (3420) presents video segment 3446, and segments A(3426) and C (3432) present video segments 3448 and 3450, respectively.

When metadata of a selected segment in an input metafile is copied to acomponent segment in a virtually edited metafile, the copy operation canbe performed by one of the two ways described below. First, all themetadata belonging to the selected segment of an input metafile arecopied to a component segment within the hierarchical structure beingedited. This method is quite simple. Moreover, a user can freely modifyor customize the copied metadata without affecting the input metafile.Second, record only the URI of the selected segment of an input metafileinto the component segment within the hierarchical structure beingedited. Since the URI is composed of a URL of the input metafile, and anidentifier of the selected segment within the file, the segment withinthe input metafile can be accessed from a virtually edited metafile ifthe URI is given. With this method, a user cannot customize the metadataof the selected segment. Users can only reference it as it is. Also, ifthe metadata of a referenced segment is modified, the virtually editedmetafile referencing the segment will be reflected accordinglyregardless of the user's intention.

In both methods, for the playback of the virtually edited metafile, theURL of input video file containing the copied or reference segment hasto be stored in the corresponding input metafile. In a virtually editedmetafile generated with the first method, if the video URLs of all thesibling nodes belonging to a component segment are equal, the URL of thevideo file is stored to the composing components having these nodes aschildren, and remove the URL of the video file from the metadata ofthese nodes. This step guarantees that all the segments belonging to thecomposing segment come from the same video file if metadata of acomposing segment has the URL of a video file. When making a play listfor playback of a composing segment, an efficient algorithm can beachieved using this characteristic. That is, when inspecting a composingsegment in order to make its play list, without inspecting its alldescendents, the inspection can be stop if the segment has a URL of avideo file.

FIG. 35 is a flowchart of the method of the present invention forvirtual video editing based on metadata. The present invention can onlybe applied in the situation where the content-based hierarchicallystructured metadata of the video is within the metafile itself or in adatabase management system (DBMS). In the flowchart of FIG. 35, it isassumed that the metadata exists in the form of metafile. Even if themetadata is stored in a DBMS, the method of the present invention can beapplied if each segment can be uniquely identified by providing sometype of key or identifier of an database object.

A detailed description of the method depicted in FIG. 35 is as follows.The method begins generally at step 3502, where a metafile of an inputvideo is loaded. Next, in step 3504, one or more segments are selectedwhile browsing through the metafile. A check is made in step 3506 todetermine if a composing segment should be created. If so, step 3508 isperformed where the composing segment is created in a hierarchicalstructure being edited within the composing buffer. Thereafter, or ifthe result of step 3506 is negative, step 3510 is performed, where acomposing segment is specified from newly created or pre-existing onesand a component segment is created as a child node of the specifiedcomposing segment. Next, in step 3512, a check is made to determine if acopy of the metadata is to be used, or a URI is used in its place. If acopy of the segment is used, then step 3516 is performed where metadataof the selected segment is copied to the newly created componentsegment. If the URI is to be used, then step 3514 is executed where theURI of the selected segment is copied to the component segment. Ineither case, step 3518 is next performed, where the URL of the inputvideo file is written to the component segment. Next, a check is made atstep 3520 to determine if all of the URL's of any of the sibling nodesare identical. If so, step 3522 is performed where the URL is written tothe parent composing segment and URL's of all of the child segments aredeleted. Thereafter, in step 3524, a check is made to determine ifanother segment is to be selected. If so, execution is looped back tostep 3504. Otherwise, a check is made at step 3526 to determine ifanother metafile is to be input to the process. If so, then executionloops back all the way to step 3502. Otherwise, a virtually editedmetafile is generated from the composing buffer in step 3528 and themethod ends.

FIGS. 36, 37, 38, 39, and 40 describe the preferred application of thepresent invention. Video 1 and its metafile along with video 2 and itsmetafile (see FIG. 33) are stored in a computer with the domain namewww.video.server1, as inputs. Video 3 and its metafile (see FIG. 33) arestored in www.video.server2. FIG. 36 is a description of the metafilefor video 1 (see FIG. 33) using extensible markup language (XML), theuniversal format for structured documents. The metafile of video 1contains the URL to video 1, and every pre-defined segment containsseveral metadata including the time information of the segment. Thepre-defined segment also has its own segment identifier to uniquelydistinguish them within a file. Video 2, and video 3 of FIG. 33 aredescribed in XML in the same way in FIG. 37 and FIG. 38, respectively.

FIGS. 39 and 40 are the representation of the metafile in XML, aftervirtually editing video 1, video 2, and video 3. Assume that themetafile is stored in www.video.server2. As indicated in FIG. 35, thereare two ways in copying a metadata of input metafile's selected segmentto a component segment of a virtually edited metafile. FIG. 39 wascomposed by the first method, which is to copy all the metadata within aselected segment to the component segment. FIG. 40 was composed by thesecond method, which is to store the URI of the selected segment to thecomposition segment. In FIG. 40, the URI is composed of the inputmetafile's URL and the segment identifier within the file, according tothe xlink and xpointer specification. The “#” between the URL and thesegment identifier indicates that the URI is composed of URL and segmentidentifier with XML. The id( ) function which has the segment identifieras its parameter, indicates that the segment identifier is uniquelyidentifiable.

To play a specific segment of the virtual edited metafile, a play listof the actual videos within the segment has to be created. The play listcontains the URLs of the videos contained in the selected segment aswell as the time information (for example, the starting frame number andduration) sequentially. When the virtual video player receives the playlist, it will play the segments arranged in the play list sequentially.FIG. 41 is a representation of the play list of the root segment in FIG.39, and FIG. 40 using XML.

FIG. 42 is the block diagram of a virtual video editor supportingvirtual video editing. In FIG. 42, the dotted line represents the flowof data file, solid line the flow of metadata, and the bold solid linethe flow of control signal. The major components of the virtual videoeditor are as follows.

The input video file (4208, 4210, 4214) and their metafile (4204, 4206,4212) reside in the local computer or computers connected by network. InFIG. 42, video 1 (4208) and video 2 (4210) resides in the local computerand video 3 (4214) in a computer connected by network. Therefor, whenthe video file and metafile are in the computer connected by network,its video file and metafile are transferred to the virtual video editor4202 through network. The above process, is processed by the filecontroller 4222 and the network controller 4220. In other words, afterthe video and metafile are transferred from the network controller 4220to user, the file controller 4222 reads the video file as well as themetafile in the local computer, or the video file and the metafiletransferred by the network. The metafile read from the file controlleris transferred to the XML parser 4224. After the XML parser validateswhether the transferred metadata are well-formed according to XMLsyntax, the metadata is stored to input buffer 4226. In this case, themetadata stored in the input buffer has a hierarchical structuredescribed in the input metafile.

A user performs virtual video editing with the structure manager 4228.First, by browsing and playing some segments of the input buffer throughthe display device 4240 using video player 4238, select a video segmentto be copied. The process of copying the metadata of the selectedsegment to the composing buffer is done by the structure manager 4228.That is, all the operations related to the creation of editedhierarchical structure as well as the management done within the inputbuffer, such as the selection of a particular composing segment,constructing a new composing segment as well as a component segment,copying the metadata, are performed by the structure manager.

For example, assume that segment c (3320) of video 2 (3304) (see FIG.33) is selected by the editor. The URL of video 2 iswww.video.server1/video2, and the URI of a segment c 3320 in themetafile is www.video.server1/metafile2.xml#id(seg_c). By referring toFIG. 37, the metadata of segment ‘seg_c’ of video 2 is as follows.<Segment id=“seg_c” title=“segment c” duration=“150”> <StartTime>230</StartTime> <MediaDuration>150</MediaDuration> <Keyframe>...</Keyframe> <Annotation>...</Annotation>  ... </Segment>

There are two methods on copying the metadata to the component segmentof a composing buffer as described in FIG. 35. First, the selectedmetadata itself is copied to the component segment generated at thecomposing buffer (see FIG. 39). <Segment id=“seg_c” title=“segment c”duration=“150”>  <MediaURI>//www.video.server1/video2</MediaURI> <StartTime>230</StartTime><MediaDuration>150</MediaDuration> <Keyframe>...</Keyframe><Annotation>...</Annotation>  ... </Segment>

Second, only the URI of selected segment is copied to the componentsegment generated at the composing buffer (see FIG. 40). <Segmentxlink:form=“simple” show=“embed” href=“//www.video.server1/metafile2.xml#id(seg_c)”>  </MediaURI>//www.video.server1/video2</MediaURI> </Segment>

To indicate which input video is related to the copied metadata, themetadata of the newly created component segment contains the URL to therelevant videos of the segment.

A play list generator 4236 is used to play segments in the hierarchicalstructure of the input buffer or composing buffer. Through themetafile's URL and time information obtained by the metadata, the playlist generator passes the play list such as FIG. 41, to video player4238. The video player plays the segments defined in the play listsequentially. The video being played is shown through the display device4240. When the editing is done, the hierarchical structure edited in thecomposing buffer is saved as metafile 4242 by the XML generator 4234.

4. Transcoding

4.1 Perceptual Hint for Image Transcoding

4.1.1 Spatial Resolution Reduction Value

The present invention also provides a novel scheme for transcoding animage to fit the size of the respective client display when an image istransmitted to a variety of client devices with different display sizes.First, the method of perceptual hints for each image block isintroduced, and then an image transcoding algorithm is presented as wellas an embodiment in the form of a system that incorporates the algorithmto produce the desired result. The perceptual hint provides theinformation on the minimum allowable spatial resolution reduction for agiven semantically important block in an image. The image transcodingalgorithm selects the best image representation to meet the clientcapabilities while delivering the largest content value. The contentvalue is defined as a quantitative measure of the information onimportance and spatial resolution for the transcoded version of animage.

A spatial resolution reduction (SRR) value is determined by either theauthor or publisher as well as by an image analysis algorithm and canalso be updated after each user interaction. SRR specifies a scalefactor for the maximum spatial resolution reduction of each semanticallyimportant block within an image. A block is defined as a spatialsegment/region within an image that often corresponds to the area of animage that depicts a semantic object such as car, bridge, face, and soforth. The SRR value represents the information on the minimum allowablespatial resolution, namely, width and height in pixels, of each block atwhich users can perceptually recognize according to the author'sexpectation. The SRR value for each block can be used as a thresholdthat determines whether the block is to be sub-sampled or dropped whenthe block is transcoded.

Consider the n number of blocks of users' interests within an imageI_(A). If one denotes the ith block as B_(i), I_(A)={B_(i)}, i=1, . . .,n, then, the SRR value r_(i) of B_(i) is modeled as follows:${r_{i} \equiv \frac{r_{i}^{\min}}{r_{i}^{o}}},$where r_(i) ^(min) is the minimum spatial resolution that human canperceive and r_(i) ^(o) is the original spatial resolution of B_(i),respectively. For simplicity, the spatial resolution is defined as thelength in pixels of either the width or height in a block.

The SRR value ranges from 0 to 1 where 0.5 indicates that the resolutioncan be reduced by half and 1 indicates the resolution cannot be reduced.For a 100×100 block whose SRR value is 0.7, for example, the author ofthe block of information can indicate that the resolution of the blockcould be reduced up to the size of 70×70 (thus, minimum allowableresolution) without degrading the perceptibility of users. This valuecan then be used to determine the acceptable boundaries of resolutionsthat can be viewed by a given device over the system of the presentinvention illustrated in FIG. 53.

The SRR value also provides a quantitative measure of how much theimportant blocks in an image can be compressed to reduce the overalldata size of the compressed image while preserving the image fidelitythat the author intended.

4.1.2 Transcoding Hint for Each Image Block

The SRR value can be best used with the importance value in J. R. Smith,R. Mohan, and C.-S. Li, “Content-based Transcoding of Images in theInternet,” in Proc. IEEE Intern. Conf. on Image Processing, October1998; and S. Paek and J. R. Smith, “Detecting Image Purpose inWorld-Wide Web Documents,” in Proc. SPIE/IS&T Photonics West, DocumentRecognition, January 1998. Both SRR value (r_(i)) and importance value(s_(i)) are associated with each B_(i). Thus:I _(A) ={B _(i)}={(r _(i) ,s _(i))}, i=1, . . . ,n.

4.1.3 Image Transcoding Algorithm Based on Perceptual Hint

4.1.3.1 Content Value Function V

Image transcoding can be viewed in a sense as adapting the content tomeet resource constraints. Rakesh Mohan, et al., modeled the contentadaptation process as a resource allocation in a generalizedrate-distortion framework. See, e.g., R. Mohan, J. R. Smith and C.-S.Li, “Multimedia Content Customization for Universal Access,” inMultimedia Storage and Archiving Systems, Boston, Mass.: SPIE, Vol.3527, November 1998; R. Mohan, J. R. Smith and C.-S. Li, “AdaptingMultimedia Internet Content for Universal Access,” IEEE Trans. onMultimedia, Vol. 1, No. 1, pp. 104-14, March 1999; and R. Mohan, J. R.Smith and C.-S. Li, “Adapting Content to Content Resources in theInternet,” in Proc. IEEE Intern. Conf. on Multimedia Comp. and SystemsICMCS99, Florence, June 1999. This framework has been built on theShannon's rate-distortion (R-D) theory that determines the minimumbit-rate R needed to represent a source with desired distortion D, oralternately, given a bit-rate R, the distortion D in the compressedversion of the source. See, C. E. Shannon, “A Mathematical Theory ofCommunications,” Bell Syst. Tech. J., Vol. 27, pp. 379-423, 1948. Theygeneralized the rate-distortion theory to a value-resource framework byconsidering different versions of a content item in an InfoPyramid asanalogous to compressions, and different client resources as analogousto the bit-rates, respectively. However, the value-resource frameworkdoes not provide the quantitative information on the allowable factorwith which blocks can be compressed while preserving the minimumfidelity that an author or a publisher intended. In other words, it doesnot provide the quantified measure of perceptibility indicating thedegree of allowable transcoding. For example, it is difficult to measurethe loss of perceptibility when an image is transcoded to a set of acropped and/or scaled ones.

To overcome this problem, an objective measure of fidelity is introducedin the present invention that models the human perceptual system that iscalled a content value function V for any transcoding configuration C:C={I,r},where I⊂{1, 2, . . . , n} is a set of indices of the blocks to becontained in the transcoded image and r is a SRR factor of thetranscoded image. The content value function V can be defined as:$\begin{matrix}{V = {V\left( {I,r} \right)}} \\{= {\sum\limits_{i \in I}^{\quad}{V_{i}(r)}}} \\{{= {\sum\limits_{i \in I}^{\quad}\left( {s_{i} \cdot {u\left( {r - r_{i}} \right)}} \right)}},}\end{matrix}$ where ${u(x)} = \left\{ {\begin{matrix}{1,} & {{{if}\quad x} \geq 0} \\{0,} & {elsewhere}\end{matrix}.} \right.$

The above definition of V now provides a measure of fidelity that isapplicable to the transcoding of an image at different resolution anddifferent sub-image modalities. In other words, V defines thequantitative measure of how much the transcoded version of an image canhave both importance and perceptual information. The V takes a valuefrom 0 to 1, where 1 indicates that all of important blocks can beperceptible in the transcoded version of image and 0 indicates that nonecan be perceptible. The value function is assumed to have the followingproperty:

Property 1: The value V is monotonically increasing in proportion to rand I. Thus:

1.1 For a fixed I, V(I, r₁)<V(I, r₂) if r₁<r₂,

1.2 For a fixed r, V(I₁,r)≦V(I₂, r) if I₁⊂I₂.

4.1.4 Content Adaptation Algorithm

Denoting the width and height of the client display size by W and H,respectively, the content adaptation is modeled as the followingresource allocation problem:${{maximize}\left( {V\left( {I,r} \right)} \right)}\quad{such}\quad{that}\quad\left\{ \begin{matrix}{{r{{x_{u} - x_{l}}}} \leq W} \\{and} \\{{r{{y_{u} - y_{l}}}} \leq H}\end{matrix} \right.$where the transcoded image is represented by a rectangular bounding boxwhose lower and upper bound points are (x_(l), y_(l)) and (x_(u), y_(u))respectively.

Lemma 1: For any I, the maximum resolution factor is given by$r_{\max}^{I} = {\min\limits_{i,{j \in I}}r_{i,j}}$ where$r_{ij} = {{\min\left( {\frac{W}{{x_{i} - x_{j}}},\frac{H}{{y_{i} - y_{j}}}} \right)}.}$

The Lemma 1 says that only those configurations C={I, r} with r≦r_(max)^(I) are feasible. Combined with property 1.1, this implies that for agiven I, the maximum value is attainable when C={I, r_(max) ^(I)}.Therefor other feasible configurations C={I, r}, r<r_(max) ^(I) do notneed to be searched. At this moment, one has a naïve algorithm forfinding an optimal solution: for all possible I⊂{1, 2, . . . ,n},calculate r^(I) _(max) by maximal resolution factor (above) and againV(I, r^(I) _(max)) by the content value function defined in thesubsection 4.1.3.1 to find an optimal configuration C_(opt).

The algorithm can be realized by considering a graphR=[r _(ij)], 1≦i,j≦n,and noting that an I corresponds to a complete subgraph (clique) of R,and then r^(I) _(max) is the minimum edge or node value in I.

Assume I to be a clique of degree K (K≧2). It is easily shown that amongthe cliques, denoted by S, of I, there are at least 2^(K−2) cliqueswhose r^(S) _(max) is equal to r^(I) _(max), which, according toProperty 1.2, need not be examined to find the maximum value of V.Therefor, only maximal clique will be searched. Initially, r is set tor^(R) _(max) so that all of the blocks could be contained in thetranscoded image. Then r is increased discretely and for the given r,the maximal cliques are only examined. A minimum heap H is maintained inorder to store and track maximal cliques with r_(max) as a sortingcriterion. The following pseudo-code is illustrative of finding theoptimal configuration: Enqueue R into H WHILE H is not empty  I isdequeued from H  Calculate V(I, r^(I) _(max))  Enqueue maximal cliquesinducible from I after removing the critical  (minimum) edge or nodeEND_WHILE Print optimal configuration that maximizes V.

FIGS. 43 and 44 demonstrate the results of transcoding according to themethod of the present invention. Specifically, FIG. 43 illustrates acomparison 4300 of a non-transformed resolution reduction scheme 4302 toa transcoded scheme 4304 of the present invention. Underneath eachexample is a content value parameter indicative of the “value” seen bythe user. As shown in FIG. 43, the images for workstations 4306 and 4316are identical in content value (1.0). When moved to a color PC with asmaller screen, the entire image is merely shrunk proportionally and thecontent value for the images 4308 and 4318 remains 1.0. However, a smalltelevision, for example, has a smaller screen. The prior art methodshrinks the image 4310 yet again, bringing the resolution detail andthus the content value to 0, while the transcoding method of the presentinvention preserves the resolution of the areas of interest 4330 in theimage 4320 while removing (cropping) relatively extraneous informationand thus commands a higher content value of 0.53. This same result isillustrated for images 4312 and 4314, for the HHC and PDA of the priorart method; and for images 4322 and 4324 for the respective examplesemploying the method of the present invention. It should be noted thatthe designation of the area(s) of interest 4330 can be specified by theauthor or an image analysis algorithm, or it may be identified byadaptive techniques through user-feedback as explained elsewhere withinthis disclosure.

Similarly, FIG. 44 illustrates a comparison 4400 of a non-transformedresolution reduction scheme 4402 to a transcoded scheme 4404 of thepresent invention. Underneath each example is a content value parameterindicative of the “value” seen by the user. As shown in FIG. 44, theimages for workstations 4406 and 4416 are identical in content value(1.0). When moved to a color PC with a smaller screen, the entire imageis merely shrunk proportionally and the content value for the images4408 and 4418 remains 1.0. However, a small television, for example, hasa smaller screen. The prior art method shrinks the image 4410 yet again,bringing the resolution detail and thus the content value to 0, whilethe transcoding method of the present invention preserves the resolutionof the area of interest 4430 in the image 4420 while removing (cropping)relatively extraneous information and thus commands a higher contentvalue of 1.0. This same result is illustrated for images 4412 and 4414,for the HHC and PDA of the prior art method; and for images 4422 and4424, for the respective examples employing the method of the presentinvention.

As described above, this disclosure has provided a novel scheme fortranscoding an image to fit the size of the respective client displaywhen an image is transmitted to a variety of client devices withdifferent display sizes. First the notion of perceptual hint for eachimage block is introduced, and then an optimal image transcodingalgorithm is presented.

4.2 Video Transcoding Scheme

The method of the present invention further provides a scheme totranscode video with a variety of client devices having differentdisplay sizes. A general overview of the scheme is illustrated in FIG.45. Generally, the content transcoder 4502 contains various modules thattake data from a content database 4504, modify the content and forwardthe modified content to the Internet for viewing by various devices.More specifically, the system 4500 has content database 4504 thatmaintains content information as well as (optionally) publisher andauthor preferences. Upon a request, either from the Internet or from aclient device such as television 4516 (or another transmitting device),a signal is received by the policy engine 4506 that resides within thecontent transcoder 4502. The policy engine 4506 is operative with thecontent database 4504 and can receive policy information from thedatabase 4504 as illustrated in FIG. 45. Content information isretrieved from the database 4504 to the content analyzer 4508 that thenforwards the content to the content selection module 4510 that isoperative also with the policy engine 4506. Based upon policy andinformation from the content analysis and manipulation library 4512,specific content is selected and forwarded to the content manipulationmodule 4514, which modifies the content for viewing by the specificrequesting device. It should be noted that the content analysis andmanipulation library 4512 is operative with most of the main modules,specifically the content analyzer 4508 as well as the content selectionmodule 4510 and the content manipulation module 4514. Typically, theoutput information from the content transcoder is forwarded to theInternet for eventual receipt and display on, for example, personalcomputer 4524 for the enjoyment of user 4526, personal data appliance4522, laptop 4520, mobile telephone 4518, and television 4516.

The policy engine module 4506 gathers the capabilities of the client,the network conditions and the transcoding preferences of the user aswell as from the publisher and/or author. This information is used todefine the transcoding options for the client. The system then selectsthe output-versions of the content and uses a library of contentanalysis and manipulation routines to generate the optimal content to bedelivered to the client device.

The content analyzer 4508 analyzes the video, namely the scene of videoframes, to find their type and purpose, the motion vector direction, andface/text, etc. Based on this information, the content selection module4510 and the manipulation module 4514 transcode the video by selectingadaptively the attention area that is defined by a position and size fora rectangular window, for example, in a video that is intended to fitthe size of the respective client display. The system 4500 will select adynamically transcoded (for example, scaled and/or cropped) area in thevideo without degrading the perceptibility of users. Also, this systemhas the manual editing routine that alters/adjusts manually the positionand size of the transcoded area by the publisher and author.

FIG. 46 illustrates an example of focus of attention area 4604 withinthe video frame 4602 that is defined by an adaptive rectangular windowin the figure. The adaptive window is represented by the position andsize as well as by the spatial resolution (width and height in pixels).Given an input video, a simplified transcoding process can be summarizedas:

1. Perform a scene analysis within the entire frame or certain slices ofthe frame;

2. Determine the widow size and position and adjust accordingly; and

3. Transcode the video according to the determined window.

Given the display size of the client device, the scene (or content)analysis adaptively determines the window position as well as thespatial resolution for each frame/clip of the video. The information onthe gradient of the edges in the image can be used to intelligentlydetermine the minimum allowable spatial resolution given the windowposition and size. The video is then fast transcoded by performing thecropping and scaling operations in the compressed domain such as DCT incase of MPEG-1/2.

The present invention also enables the author or publisher to dictatethe default window size. That size represents the maximum spatialresolution of area that users can perceptually recognize according tothe author's expectation. Furthermore, the default window position isdefined as the central point of the frame. For example, one can assumethat this default window size is to contain the central 64% area byeliminating 10% background from each of the four edges, assuming noresolution reduction. The default window can be varied or updated afterthe scene analysis. The content/scene analyzer module analyzes the videoframes to adaptively track the attention area. The following areheuristic examples of how to identify the attention area. These examplesinclude frame scene types (e.g., background), synthetic graphics,complex, etc., that can help to adjust the window position and size.

4.2.1 Landscape or Background

Computers have difficulty finding outstanding objects perceptually. Butcertain types of objects can be identified by text and face detection orobject segmentation. Where the objects are defined as spatial region(s)within a frame, they may correspond to regions that depict differentsemantic objects such as cards, bridges, faces, embedded texts, and soforth. For example, in the case that there exist no larger objects(especially faces and text) than a specific threshold value within theframe, one can define this specific frame as the landscape orbackground. One may also use the default window size and position.

4.2.2 Synthetic graphics

One may also adjust the window to display the whole text. The textdetection algorithm can determine the window size.

4.2.3 Complex

In the case of the existing recognized (synthetic or natural) objectswhose size is larger than a specific threshold value within the frame,initially one may select the most important object among objects andinclude this object in the window. The factors that have been found toinfluence the visual attention include the contrast, shape, size andlocation of the objects. For example, the importance of an object can bemeasured as follows:

-   -   1. Important objects are in general in high contrast with their        background;    -   2. The bigger the size of an object is, the more important it        is;    -   3. A thin object has high shape importance while a rounder        object will have lower one; and    -   4. The importance of an object is inversely proportional to the        distance of center of the object to the center of the frame.        At a highly semantic level, the criteria for adjusting the        window are, for example:    -   1. Frame with text at the bottom such as in news; and    -   2. Frame/scene where two people are talking each other. For        example, person A is in the left side of the frame. The other is        in the right side of the frame. Given the size of the adaptive        window, one cannot include both in the given window size unless        the resolution is reduced further. In this case, one has to        include only one person.        5. Visual Rhythm        The visual rhythm of a video is a single image, that is, a        two-dimensional abstraction of the entire three-dimensional        content of the video constructed by sampling certain group of        pixels of each image sequence and temporally accumulating the        samples along time. Each vertical line in the visual rhythm of a        video consists of a small number of pixels sampled from a        corresponding frame of the video according to a specific        sampling strategy. FIG. 26 shows several different sampling        strategies 2600 such as horizontal sampling 2603, vertical        sampling 2605, and diagonal sampling 2607. For example, the        diagonal sampling strategy 2607 is to sample some pixels        regularly from those lying at a diagonal line of each frame of a        video. The sampling strategies illustrated in FIG. 26 are only a        partial list of all realizable sampling strategies for visual        rhythm utilized for many useful applications such as shot        detection and caption text detection.

The sampling strategies must be carefully chosen for constructing thevisual rhythm to retain the edit effects that characterize shot changes.Diagonal sampling provides the best visual features for distinguishingvarious video editing effects on the visual rhythm. All visual rhythmspresented hereafter are assumed to be constructed using the diagonalsampling strategy for shot detection. But the presented invention can beeasily applied to any sampling strategy.

The construction of visual rhythm is, however, a very time-consumingprocess using conventional video decoders for digital video because theyare designed to decode all pixels composing a frame while visual rhythmrequires only a few pixels of a frame. Therefor, one needs an efficientmethod to construct visual rhythm as fast as possible in compressedvideo. The method will thus enable the time of shot detection process tobe greatly reduced, as well as the text caption detection process, orany other application derived from it.

In video terminology, a compression method that employs only spatialredundancy is referred to as an intraframe coda, and frames coded insuch a way are defined as intra-coded frames. Most video coders adoptblock-based coding either in the spatial or transform domain forintraframe coding to reduce spatial redundancy. For example, MPEG adoptsdiscrete cosine transform (DCT) of 8×8 block into which 64 neighboringpixels are exclusively grouped. Therefor, whatever compression scheme(DCT, discrete wavelet transform, vector quantization, etc.) is adoptedfor a given block, one need only decompress a small number of blocks inan intra-coded frame, instead of decoding the whole blocks composing theframe when only few pixels out of the whole pixels are needed. Thissituation is similarly applied to loose JPEG on individual images. Inorder to achieve optimum compression, most video coders also use amethod that exploits the temporal redundancy between frames, referred asinterframe coding (predictive, interpolate) by tracking the N×M block inthe reference picture that better matches (according to a givencriterion) the characteristics of the block in the current picture, inwhich for the specific case of MPEG compression standard N, M=16,commonly referred to as macroblock. However, the present invention doesnot restrict to this rectangular geometry but assumes that the geometryof the matching block at the reference picture need not be the same asthe geometry of the block in the current picture, since objects in thereal world undergo scale changes as well as rotation and warping. Anefficient way to only decode the actual group of pixels needed forconstructing visual rhythm of such a hybrid (intraframe and interframe)coded frames can be processed as follows:

-   -   1. Out of the blocks composing a given frame sequence, decode        only the blocks needed to decode the blocks containing at least        one pixel, selected by a predetermined sampling strategy for        constructing visual rhythm; and    -   2. Obtain the pixel values for constructing visual rhythm from        the decoded blocks.

For example, define three different types of pictures using the MPEG-1terminology. Intra-pictures (I-pictures) are compressed using intraframecoding; that is, they do not reference any other pictures in the codedbit stream. Referring to FIG. 22, predicted pictures (P-picture) 2204and 2202 are coded using motion-compensated prediction from pastI-picture 2206 or P-picture 2204, respectively. Bidirectionallypredicted picture (B-pictures) 2210 are coded using motion-compensatedprediction from either past and/or future I-pictures 2206 or P-pictures2204 and 2202. Therefor, given a pixel selected by a predeterminedsampling strategy for constructing visual rhythm, one needs only decodethe blocks in I-, P- and B-pictures needed to decode the blockcontaining the corresponding pixel in the current picture.

Many video coding applications restrict the search to a [−p, p−1] regionaround the original location of the block due to thecomputation-intensive operations to find an N×M pixel region in thereference picture that better matches (according to a given criterion)the characteristics of N×M pixel region in the current picture. Thisimplies that one need only decompress the blocks within the [−p, p−1]region around original location of the blocks containing the pixels tobe sampled for constructing visual rhythm in pictures possiblyreferenced by other picture types for motion compensation. For picturesthat cannot be referenced by other picture types for motioncompensation, one only needs to decompress the blocks containing thepixels sampled for visual rhythm.

For example, FIG. 23 and FIG. 24 illustrate the shaded blocks that needto be decompressed for the construction of visual rhythm in frames thatcan be referenced by other frames for motion compensation and framesthat can't be referenced by other frames, respectively. Visual rhythmconstructed by sampling the diagonal pixels located on 2308 of a frame2302, one only needs to decompress the shaded blocks in FIG. 23 whichlie in between the lines 2304 and 2310 (separated by value 2306, thesearch range p of motion prediction). For frames not referenced by otherframes (B-pictures), one simply needs to decompress the blocks locatedalong the diagonal line 2404 of the frame 2402 as illustrated in FIG.24.

Such approach allows one to obtain certain group of pixels withoutdecoding unnecessary blocks and guarantees that the pixel valuesobtained from the decoded blocks can be obtained for constructing visualrhythm even without fully decoding the whole blocks composing each framesequence.

For some compression schemes using the discrete cosine transform (DCT)for intra-frame coding like Motion-JPEG and MPEG or any other transformdomain compression schemes such as discrete wavelet transform, it isfurther possible to reduce the time for constructing visual rhythm. Forexample, a DCT block of N×N pixels is transformed to the frequencydomain representation resulting in one DC and (N×N−1) AC coefficients.The single DC coefficient is N-times the average of all N×N pixelvalues. It means that the DC coefficient of a DCT block can be served asa pixel value of a pixel included in the block if accurate pixel valuesmay not be required. Extraction of a DC coefficient from a DCT block canbe performed fast because it does not fully decode the DCT block. In thepresent invention, after recognizing the shaded blocks illustrated inFIGS. 23 and 24, the extraction of DC coefficients from the blocks canbe utilized instead of fully decoding the blocks and obtaining the pixelvalues of the pixels that will be selected by a predetermined samplingstrategy for constructing visual rhythm. The same approach can beapplied to any given compression scheme by only utilizing anycoefficients readily available through compression.

Fast Text Detection

For the design of an efficient real-time caption text locator, resort ismade of using a portion of the original video called a partial video.The partial video must retain most, if not all, of the caption textinformation. The visual rhythm, as defined below, satisfies thisrequirement. Let f_(DC)(x, y, t) be the pixel value at location (x, y)of an arbitrary DC image that consists of the DC coefficients of theoriginal frame t. Using the sequences of DC images of a video called theDC sequence, the visual rhythm VR of the video V is defined as follows:VR={f _(VR)(z,t)}={f _(DC)(x(z),y(z),t)}where x(z) and y(z) are one-dimensional functions of the independentvariable z. Thus, the visual rhythm is a two-dimensional imageconsisting of DC coefficients sampled from a three-dimensional data (DCsequence). Visual rhythm is also an important visual feature that can beutilized to detect scene changes.

The sampling strategies, x(z) and y(z), must be carefully chosen for thevisual rhythm to retain caption text information. One sets x(z), y(z) as$\left( {{x(z)},{y(z)}} \right) = \left\{ \begin{matrix}{\left( {{\frac{W}{H}z},z} \right),} & {0 \leq z < H} \\{\left( {{{2W} - {\frac{W}{H}z}},{z - H}} \right),} & {H \leq z < {2H}} \\{\left( {\frac{W}{2},{z - {2H}}} \right),} & {{2H} \leq z < {3H}}\end{matrix} \right.$where W and H are the width and the height of the DC sequence,respectively.

The sampling strategies above are due partially, if not entirely, toempirical observations that portions of caption text generally tend toappear on these particular region. FIG. 26 illustrates a set of samplingstrategies for constructing visual rhythm from a set of frames making upa video stream. Specifically, the frame sequence 2602 utilizes a singlehorizontal sampling 2603 across the middle of the frame. Alternatively,the frame sequence 2604 utilizes vertical sampling 2605 from top tobottom of the frame midway between the left and right sides. Finally,the frame sequence 2606 utilizes diagonal sampling 2607 from one cornerof the frame to the cattycorner. It will be understood that the scanningtechniques noted above can be mixed and matched (e.g., combiningvertical and diagonal) and that multiple scans can take place (e.g.,multiple horizontal scans, or cross-diagonal scans) to enhance thesearch, albeit with a potential performance loss due to the extracomputational overhead. However, the sampling strategies can be set in aflexible manner for text detection of specific video materials where theapproximate regions of caption text are known a priori.

FIG. 27(a) shows an example of visual rhythm when diagonals of a frameare sampled. Referring to FIG. 27(c), frame 2714 is one of a set offrames used to construct binarized visual rhythm 2712 where only thepixels 2718 corresponding to caption text are represented in white. Acaption 2716 is embedded in the frame 2714 and the subsequent set offrames used to construct the binarized visual rhythm 2712 so that“caption line” 2718 is formed within the binarized visual rhythm 2712.FIG. 27(a) and FIG. 27(b) illustrate the visual rhythm 2702 of videocontent (FIG. 27(a)) and its corresponding binarized visual rhythm 2708where pixels corresponding to caption 2710 are represented in white(FIG. 27(b)). Caption text embedded in zone 2706 of visual rhythmillustrated in FIG. 27(a) shows that caption possess certain propertiessuch as in region 2704. This region 2704 of FIG. 27(a) can be separatedand is represented in white 2710 as in FIG. 27(b) to form binarizedvisual rhythm 2708. Once the binarized visual rhythm 2708 is obtained,only a portion of the content of the entire frame need be scanned inorder to extract the textual information in order to create appropriatemultimedia bookmarks according to the method of the present invention.As illustrated in FIG. 28, the method of the present invention similarlyenables to locate the caption text 2804 of a frame 2802, as well asmultiple captions 2808, 2810, and 2812 from another frame 2806 andextract the text and obtain the binarized results 2804′, 2808′, 2810′,and 2812′ for subsequent processing, recognizing text, indexing, storingand retrieving.

Caption Frame Detection

The caption frame detection stage seeks for caption frames, which hereinare defined as a video or an image frame that contains one or morecaption text. Caption frame detection algorithm is based on thefollowing characteristics of caption text within video:

1. Characters in a single caption text tend to have similar color;

2. Captioned text tends to retain their size and font over multipleframes;

3. Text caption is either stationary or linearly moving;

4. Text caption contrast with their background; and

5. Text caption remains in the scene for a number of consecutive frames.

It is preferable to restrict oneself to locating only stationary captiontext because stationary text is more often an important carrier ofinformation and herewith more suitable for indexing and retrieving thanmoving caption text. Therefor, for purposes of this disclosure referenceis made to stationary caption text for caption text mentioned in therest of this disclosure.

With the above characteristics of video, one could observe that pixelscorresponding to caption text sampled from portions of DC sequencemanifest themselves as long horizontal line 2704 in high contrast withtheir background on the visual rhythm 2702. Hence, horizontal lines onthe visual rhythm in high contrast with their background are mostly dueto caption text, and they provide clues of when each caption textappears within the video. Thus, visual rhythm serves as an importantvisual feature for detecting caption frames.

First of all, to detect caption frames, horizontal edge detection isperformed on visual rhythm using Prewitt edge operator with convolutionkernels $\begin{bmatrix}{- 1} & {- 1} & {- 1} \\0 & 0 & 0 \\1 & 1 & 1\end{bmatrix}\quad$on visual rhythm to obtain VR_(edge)(z, t) as follows:${{VR}_{edge}\left( {z,t} \right)} = {\sum\limits_{i = {- 1}}^{1}{\sum\limits_{j = {- 1}}^{1}{w_{i,j}{f_{VR}\left( {{z + j},{t + i}} \right)}}}}$

To obtain caption line defined as horizontal line on the visual rhythm,possibly formed due to portions of caption text, edge valuesVR_(edge)(z, t) value greater than τ=150 and edge values VR_(edge)(z, t)are connected in the horizontal direction. Caption lines with lengthsshorter than frame length corresponding to a specific amount of time isneglected, since caption text usually remains in the scene for a numberof consecutive frames. Through several experiments on various types ofvideo materials, shortest captions appear to be active for at least twoseconds, which translates into a caption line with frame length of 60 ifthe video is digitized at 30 frames per second. Thus caption lines withlength less than 2 seconds can be eliminated. The resulting set ofcaption lines with the temporal duration appear in the form:LINE_(k) ,[z _(k) ,t _(k) ^(start) ,t _(k) ^(end) ],k=1, . . . ,N_(LINE)where [z_(k),t_(k) ^(start),t_(k) ^(end)] denotes the Z coordinate,beginning and end frame of the occurrence of caption line LINE_(k) onthe visual rhythm, respectively, and N_(LINE) is the total number ofcaption lines. The caption lines are ordered by increasing startingframe number:t ₁ ^(start) ≦t ₂ ^(start) ≦ . . . ≦t _(N) _(LINE) ^(start)FIG. 27(b) shows VR_(Binarized)(Z,t), the binarized visual rhythmrepresenting caption lines in white 2710 possibly formed due to captiontext from visual rhythm of FIG. 27(a), where${{VR}_{Binarized}\left( {z,t} \right)} = \left\{ \begin{matrix}{1,} & {{z = z_{\quad k}},} & {t_{\quad k}^{\quad{start}} \leq t \leq t_{\quad k}^{\quad{end}}} \\{0,} & {elsewhere} & \quad\end{matrix} \right.$where k=1, . . . ,N_(LINE).

The frames not in between the temporal duration of the resulting set ofcaption lines can be assumed to not contain any caption text and arethus omitted as caption frame candidates.

Caption Text Localization

Caption text localization stage seeks to spatially localize caption textwithin the caption frame along with its temporal duration within thevideo.

Let f_(DC)(x, y, t) be the pixel value at (x, y) of the DC image offrame t. Given the sampling strategy in equation (2) for the visualrhythm, caption line, LINE_(k), is formed due to a portion of captiontext located on (x, y)=(x(z_(k)), y(z_(k))) in DC sequences betweent_(k) ^(start) and t_(k) ^(end).

Furthermore, if a portion of caption text is located on (x,y)=(x(z_(k)), y(z_(k))) within a DC image, one can assume for otherportions of caption text to appear along y=y(z_(k)) because caption textis usually horizontally aligned. Therefor, a caption line can be used toapproximate the location of caption text within the frame, and enableone to provide an algorithm to focus on specific area of the frame.

Thus, from the above observations, for each LINE_(k) it is possible tosimply segment caption text region located along y=y(z_(k)) on a DCimage in between t_(k) ^(start) and t_(k) ^(end) and assume thissegmented region to appear along the temporal duration of caption lineLINE_(k).

To localize a caption text candidate regions for caption line LINE_(k),it is preferable to cluster pixels with values f_(VR)(z_(k), t)±δ (whereδ=10) from the pixels of horizontal scanline y=y(z_(k)) with valuef_(VR)(z_(k), t), using 4-connected clustering algorithm in the DC imageof frame t, where t=(t_(k) ^(start)+t_(k) ^(end))/2. This is partiallybecause the character in a single text caption tends to have similarcolor and is horizontally aligned. Each of the clustered regionscontains the value of leftmost, rightmost, top and down location of thepixels that are merged together.

Once the clustered regions have been obtained for LINE_(k), one needs tomerge regions corresponding to portions of a caption text to formbounding box around the caption text. It is preferable to verify whethereach region is formed by caption text based upon the heuristic obtainedthrough empirical observations on text across a range of text sources.Because the focus is on finding caption text, a clustered region shouldhave similar clustered regions nearby that belong to the same captiontext. Such heuristic can be described using connectability, which isdefined as:

-   -   Let A and B be different text candidate regions. A and B are        connectable if they are of similar height and horizontally        aligned, and there is a path between A and B.

Here, two regions are considered to be of similar height if the heightof a shorter region is at least 40% of the height of a taller one. Todetermine the horizontal alignment, regions are project onto the Y-axis.If the overlap of the projections of two regions is at least 50% of theshorter one, they are considered to be horizontally aligned. Inaddition, it is clear that regions corresponding to the same captiontext should be close to each other. By empirical observations, thespacing between the characters and words of a caption text is usuallyless than three times the height of the tallest character, and so is thewidth of a character in most fonts. Therefor, the following criterion isoptionally used to merge regions corresponding to portions of captiontext to obtain a bounding box around the caption text:

-   -   Two regions, A and B, are merged if they are connectable and        there is a path between A and B whose length is less than 3        times the height of the taller region.

Moreover, the aspect-ratio constraint can be enforced on the finalmerged regions:${\frac{Width}{Height} > \tau_{A}},{\left( {\tau_{A} = 0.7} \right).}$where Width and Height are the width and height of the final captiontext region.

The caption text region is expected to meet the above constraint;otherwise, they are removed as text regions. The final caption textregion takes the temporal duration of its corresponding caption line.

The above procedures are iterated to obtain a bounding box around thecaption text for each caption line LINE_(k), in increasing order of k(k=1, . . . , N_(LINE)). However, since several caption lines areusually formed due to the same caption text, the caption textlocalization process is omitted for a caption line LINE_(k) if thereexists any caption text region obtained beforehand on the horizontalscanline y=y(z_(k)). The usefulness of this text region extraction stepis that it is inexpensive and fast, robustly supplying bounding boxesaround caption text along with their temporal information.

The present invention, therefor, is well-adapted to carry out theobjects and attain both the ends and the advantages mentioned, as wellas other benefits inherent therein. While the present invention has beendepicted, described, and is defined by reference to particularembodiments of the invention, such references do not imply a limitationon the invention, and no such limitation is to be inferred. Theinvention is capable of considerable modification, alternation,alteration, and equivalents in form and/or function, as will occur tothose of ordinary skill in the pertinent arts. The depicted anddescribed embodiments of the invention are exemplary only, and are notexhaustive of the scope of the invention. Consequently, the invention isintended to be limited only by the spirit and scope of the appendedclaims, giving full cognizance to equivalents in all respects.

1-45. (canceled) 46-66. (canceled)
 67. A method for sending a multimediabookmark between devices over a wireless network, the method comprising:submitting a multimedia bookmark to a video bookmark message servicecenter by a sending device; acknowledging receipt of the multimediabookmark by the video bookmark message service center to the sendingdevice; requesting routing information for a recipient device from ahome location register by the video bookmark message service center;receiving the routing information from the home location register by thevideo bookmark message service center; invoking a send multimediabookmark at a mobile switching center; sending the multimedia bookmarkto a recipient device by the mobile switching center; acknowledgingreceipt of the multimedia bookmark by the recipient device; andnotifying the video bookmark message service center when the multimediabookmark has been received by the recipient device.
 68. The method ofclaim 67 further comprising the sending and recipient devices includingwireless devices. 69-83. (canceled)